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Preface 


Purpose 


As teachers of linear algebra, we wanted to write a book to help students and the general 
public appreciate and understand one of the most exciting applications of linear algebra 
today—the use of link analysis by web search engines. This topic is inherently interesting, 
timely, and familiar. For instance, the book answers such curious questions as: How do 
search engines work? Why is Google so good? What’s a Google bomb? How can I 
improve the ranking of my homepage in Teoma? 


We also wanted this book to be a single source for material on web search engine rank- 
ings. A great deal has been written on this topic, but it’s currently spread across numerous 
technical reports, preprints, conference proceedings, articles, and talks. Here we have 
summarized, clarified, condensed, and categorized the state of the art in web ranking. 


Our Audience 


We wrote this book with two diverse audiences in mind: the general science reader 
and the technical science reader. The title echoes the technical content of the book, but 
in addition to being informative on a technical level, we have also tried to provide some 
entertaining features and lighter material concerning search engines and how they work. 


The Mathematics 


Our goal in writing this book was to reach a challenging audience consisting of the 
general scientific public as well as the technical scientific public. Of course, a complete 
understanding of link analysis requires an acquaintance with many mathematical ideas. 
Nevertheless, we have tried to make the majority of the book accessible to the general sci- 
entific public. For instance, each chapter builds progressively in mathematical knowledge, 
technicality, and prerequisites. As a result, Chapters 1-4, which introduce web search and 
link analysis, are aimed at the general science reader. Chapters 6, 9, and 10 are particularly 
mathematical. The last chapter, Chapter 15, “The Mathematics Guide,” is a condensed but 
complete reference for every mathematical concept used in the earlier chapters. Through- 
out the book, key mathematical concepts are highlighted in shaded boxes. By postponing 
the mathematical definitions and formulas until Chapter 15 (rather than interspersing them 
throughout the text), we were able to create a book that our mathematically sophisticated 
readers will also enjoy. We feel this approach is a compromise that allows us to serve both 
audiences: the general and technical scientific public. 


X PREFACE 


Asides 


An enjoyable feature of this book is the use of Asides. Asides contain entertaining news 
stories, practical search tips, amusing quotes, and racy lawsuits. Every chapter, even the 
particularly technical ones, contains several asides. Often times a light aside provides the 
perfect break after a stretch of serious mathematical thinking. Brief asides appear in shaded 
boxes while longer asides that stretch across multiple pages are offset by horizontal bars 
and italicized font. We hope you enjoy these breaks—we found ourselves looking forward 
to writing them. 


Computing and Code 


Truly mastering a subject requires experimenting with the ideas. Consequently, we have 
incorporated Matlab code to encourage and jump-start the experimentation process. While 
any programming language is appropriate, we chose Matlab for three reasons: (1) its matrix 
storage architecture and built-in commands are particularly suited to the large sparse link 
analysis matrices of this text, (2) among colleges and universities, Matlab is a market leader 
in mathematical software, and (3) it’s very user-friendly. The Matlab programs in this book 
are intended to be instruction, not production, code. We hope that, by playing with these 
programs, readers will be inspired to create new models and algorithms. 
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Chapter One 


Introduction to Web Search Engines 


1.1 A SHORT HISTORY OF INFORMATION RETRIEVAL 


Today we have museums for everything—the museum of baseball, of baseball players, of 
crazed fans of baseball players, museums for world wars, national battles, legal fights, and 
family feuds. While there’s no shortage of museums, we have yet to find a museum ded- 
icated to this book’s field, a museum of information retrieval and its history. Of course, 
there are related museums, such as the Library Museum in Boras, Sweden, but none con- 
centrating on information retrieval. Information retrieval! is the process of searching 
within a document collection for a particular information need (called a query). Although 
dominated by recent events following the invention of the computer, information retrieval 
actually has a long and glorious tradition. To honor that tradition, we propose the cre- 
ation of a museum dedicated to its history. Like all museums, our museum of information 
retrieval contains some very interesting artifacts. Join us for a brief tour. 


The earliest document collections were recorded on the painted walls of caves. A 
cave dweller interested in searching a collection of cave paintings to answer a particular 
information query had to travel by foot, and stand, staring in front of each painting. Un- 
fortunately, it’s hard to collect an artifact without being gruesome, so let’s fast forward a 
bit. 


Before the invention of paper, ancient Romans and Greeks recorded information on 
papyrus rolls. Some papyrus artifacts from ancient Rome had tags attached to the rolls. 
These tags were an ancient form of today’s Post-it Note, and make an excellent addition to 
our museum. A tag contained a short summary of the rolled document, and was attached 
in order to save readers from unnecessarily unraveling a long irrelevant document. These 
abstracts also appeared in oral form. At the start of Greek plays in the fifth century B.c., 
the chorus recited an abstract of the ensuing action. While no actual classification scheme 
has survived from the artifacts of Greek and Roman libraries, we do know that another 
elementary information retrieval tool, the table of contents, first appeared in Greek scrolls 
from the second century B.C. Books were not invented until centuries later, when necessity 
required an alternative writing material. As the story goes, the Library of Pergamum (in 
what is now Turkey) threatened to overtake the celebrated Library of Alexandria as the 
best library in the world, claiming the largest collection of papyrus rolls. As a result, the 
Egyptians ceased the supply of papyrus to Pergamum, so the Pergamenians invented an 
alternative writing material, parchment, which is made from thin layers of animal skin. (In 
fact, the root of the word parchment comes from the word Pergamum.) Unlike papyrus, 


'The boldface terms that appear throughout the book are also listed and defined in the Glossary, which begins 
on page 201. 
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parchment did not roll easily, so scribes folded several sheets of parchment and sewed them 
into books. These books outlasted scrolls and were easier to use. Parchment books soon 
replaced the papyrus rolls. 


The heights of writing, knowledge, and documentation of the Greek and Roman 
periods were contrasted with their lack during the Dark and Middle Ages. Precious few 
documents were produced during this time. Instead, most information was recorded orally. 
Document collections were recorded in the memory of a village’s best storyteller. Oral 
traditions carried in poems, songs, and prayers were passed from one generation to the 
next. One of the most legendary and lengthy tales is Beowulf, an epic about the adventures 
of a sixth-century Scandinavian warrior. The tale is believed to have originated in the 
seventh century and been passed from generation to generation through song. Minstrels 
often took poetic license, altering and adding verses as the centuries passed. An inquisitive 
child wishing to hear stories about the monster Grendel waited patiently while the master 
storyteller searched his memory to find just the right part of the story. Thus, the result of the 
child’s search for information was biased by the wisdom and judgement of the intermediary 
storyteller. Fortunately, the invention of paper, the best writing medium yet, superior to 
even parchment, brought renewed acceleration to the written record of information and 
collections of documents. In fact, Beowulf passed from oral to written form around A.D. 
1000, a date over which scholars still debate. Later, monks, the possessors of treasured 
reading and writing skills, sat in scriptoriums working as scribes from sunrise to sunset. 
The scribes’ works were placed in medieval libraries, which initially were so small that 
they had no need for classification systems. Eventually the collections grew, and it became 
common practice to divide the holdings into three groups: theological works, classical 
authors of antiquity, and contemporary authors on the seven arts. Lists of holdings and 
tables of contents from classical books make nice museum artifacts from the medieval 
period. 


Other document collections sprung up in a variety of fields. This dramatically ac- 
celerated with the re-invention of the printing press by Johann Gutenberg in 1450. The 
wealthy proudly boasted of their private libraries, and public libraries were instituted in 
America in the 1700s at the prompting of Benjamin Franklin. As library collections grew 
and became publicly accessible, the desire for focused search became more acute. Hierar- 
chical classification systems were used to group documents on like subjects together. The 
first use of a hierarchical organization system is attributed to the Roman author Valerius 
Maximus, who used it in A.D. 30 to organize the topics in his book, Factorum ac dicto- 
rum memorabilium libri IX (Nine Books of Memorable Deeds and Sayings). Despite these 
rudimentary organization systems, word of mouth and the advice of a librarian were the 
best means of obtaining accurate quality information for a search. Of course, document 
collections and their organization expanded beyond the limits of even the best librarian’s 
memory. More orderly ways of maintaining records of a collection’s holdings were de- 
vised. Notable artifacts that belong in our information retrieval museum are a few lists 
of individual library holdings, sorted by title and also author, as well as examples of the 
Dewey decimal system (1872), the card catalog (early 1900s), microfilm (1930s), and the 
MARC (MAchine Readable Cataloging) system (1960s). 


These inventions were progress, yet still search was not completely in the hands of 
the information seeker. It took the invention of the digital computer (1940s and 1950s) and 
the subsequent inventions of computerized search systems to move toward that goal. The 
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first computerized search systems used special syntax to automatically retrieve book and 
article information related to a user’s query. Unfortunately, the cumbersome syntax kept 
search largely in the domain of librarians trained on the systems. An early representative 
of computerized search such as the Cornell SMART system (1960s) [146] deserves a place 
in our museum of information retrieval. 


In 1989 the storage, access, and searching of document collections was revolution- 
ized by an invention named the World Wide Web by its founder Tim Berners-Lee [79]. Of 
course, our museum must include artifacts from this revolution such as a webpage, some 
HTML, and a hyperlink or two. The invention of linked document collections was truly 
original at this time, despite the fact that Vannevar Bush, once Director of the Office of 
Scientific Research and Development, foreshadowed its coming in his famous 1945 essay, 
“As We May Think” [43]. In that essay, he describes the memex, a futuristic machine 
(with shocking similarity to today’s PC and Web) that mirrors the cognitive processes of 
humans by leaving “trails of association” throughout document collections. Four decades 
of progress later, remnants of Bush’s memex formed the skeleton of Berners-Lee’s Web. A 
drawing of the memex (Figure 1.1) by a graphic artist and approved by Bush was included 
in LIFE magazine’s 1945 publishing of Bush’s prophetic article. 


Figure 1.1 Drawing of Vannevar Bush’s memex appearing in LIFE. Original caption read: “Memex 
in the form of a desk would instantly bring files and material on any subject to the op- 
erator’s fingertips. Slanting translucent screens supermicrofilm filed by code numbers. 
At left is a mechanism which automatically photographs longhand notes, pictures, and 
letters, then files them in the desk for future reference.” 


The World Wide Web became the ultimate signal of the dominance of the Informa- 
tion Age and the death of the Industrial Age. Yet despite the revolution in information 
storage and access ushered in by the Web, users initiating web searches found themselves 
floundering. They were looking for the proverbial needle in an enormous, ever-growing 
information haystack. In fact, users felt much like the men in Jorge Luis Borges’ 1941 
short story [35], “The Library of Babel”, which describes an imaginary, infinite library. 


When it was proclaimed that the Library contained all books, the first im- 
pression was one of extravagant happiness. All men felt themselves to be 
the masters of an intact and secret treasure. There was no personal or world 
problem whose eloquent solution did not exist in some hexagon. 
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. .. As was natural, this inordinate hope was followed by an excessive depres- 
sion. The certitude that some shelf in some hexagon held precious books and 
that these precious books were inaccessible seemed almost intolerable. 


Much of the information in the Library of the Web, like that in the fictitious Library 
of Babel, remained inaccessible. In fact, early web search engines did little to ease user 
frustration; search could be conducted by sorting through hierarchies of topics on Yahoo, or 
by sifting through the many (often thousands of) webpages returned by the search engine, 
clicking on pages to personally determine which were most relevant to the query. Some 
users resorted to the earliest search techniques used by ancient queriers—word of mouth 
and expert advice. They learned about valuable websites from friends and linked to sites 
recommended by colleagues who had already put in hours of search effort. 


All this changed in 1998 when link analysis hit the information retrieval scene 
[40, 106]. The most successful search engines began using link analysis, a technique that 
exploited the additional information inherent in the hyperlink structure of the Web, to im- 
prove the quality of search results. Web search improved dramatically, and web searchers 
religiously used and promoted their favorite engines like Google and AltaVista. In fact, in 
2004 many web surfers freely admit their obsession with, dependence on, and addiction 
to today’s search engines. Below we include the comments [117] of a few Google fans 
to convey the joy caused by the increased accessibility of the Library of the Web made 
possible by the link analysis engines. Incidentally, in May 2004 Google held the largest 
share of the search market with 37% of searchers using Google, followed by 27% using 
the Yahoo conglomerate, which includes AltaVista, AlltheWeb, and Overture.” 


e “It’s not my homepage, but it might as well be. I use it to ego-surf. I use 
it to read the news. Anytime I want to find out anything, I use it.’-—Matt 
Groening, creator and executive producer, The Simpsons 


e “I can’t imagine life without Google News. Thousands of sources from 
around the world ensure anyone with an Internet connection can stay in- 
formed. The diversity of viewpoints available is staggering.’-—Michael 
Powell, chair, Federal Communications Commission 


e “Google is my rapid-response research assistant. On the run-up to a 
deadline, I may use it to check the spelling of a foreign name, to acquire 
an image of a particular piece of military hardware, to find the exact 
quote of a public figure, check a stat, translate a phrase, or research the 
background of a particular corporation. It’s the Swiss Army knife of 
information retrieval.’—Garry Trudeau, cartoonist and creator, Doones- 
bury 


Nearly all major search engines now combine link analysis scores, similar to those 
used by Google, with more traditional information retrieval scores. In this book, we record 
the history of one aspect of web information retrieval. That aspect is the link analysis 
or ranking algorithms underlying several of today’s most popular and successful search 


These market share statistics were compiled by comScore, a company that counted the number of searches 
done by U.S. surfers in May 2004 using the major search engines. See the article at 
http: //searchenginewatch.com/reports/article.php/2156431. 
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engines, including Google and Teoma. Incidentally, we’ll add the PageRank link analysis 
algorithm [40] used by Google (see Chapters 4-10) and the HITS algorithm [106] used by 
Teoma (see Chapter 11) to our museum of information retrieval. 


1.2 AN OVERVIEW OF TRADITIONAL INFORMATION RETRIEVAL 


To set the stage for the exciting developments in link analysis to come in later chapters, we 
begin our story by distinguishing web information retrieval from traditional informa- 
tion retrieval. Web information retrieval is search within the world’s largest and linked 
document collection, whereas traditional information retrieval is search within smaller, 
more controlled, nonlinked collections. The traditional nonlinked collections existed be- 
fore the birth of the Web and still exist today. Searching within a university library’s col- 
lection of books or within a professor’s reserve of slides for an art history course—these 
are examples of traditional information retrieval. 


These document collections are nonlinked, mostly static, and are organized and cate- 
gorized by specialists such as librarians and journal editors. These documents are stored in 
physical form as books, journals, and artwork as well as electronically on microfiche, CDs, 
and webpages. However, the mechanisms for searching for items in the collections are 
now almost all computerized. These computerized mechanisms are referred to as search 
engines, virtual machines created by software that enables them to sort through virtual 
file folders to find relevant documents. There are three basic computer-aided techniques 
for searching traditional information retrieval collections: Boolean models, vector space 
models, and probabilistic models [14]. These search models, which were developed in 
the 1960s, have had decades to grow, mesh, and morph into new search models. In fact, 
as of June 2000, there were at least 3,500 different search engines (including the newer 
web engines) [37], which means that there are possibly 3,500 different search techniques. 
Nevertheless, since most search engines rely on one or more of the three basic models, we 
describe these in turn. 


1.2.1 Boolean Search Engines 


The Boolean model of information retrieval, one of the earliest and simplest retrieval meth- 
ods, uses the notion of exact matching to match documents to a user query. Its more refined 
descendents are still used by most libraries. The adjective Boolean refers to the use of 
Boolean algebra, whereby words are logically combined with the Boolean operators AND, 
OR, and NOT. For example, the Boolean AND of two logical statements x and y means that 
both x AND y must be satisfied, while the Boolean OR of these two statements means that 
at least one of these statements must be satisfied. Any number of logical statements can be 
combined using the three Boolean operators. The Boolean model of information retrieval 
operates by considering which keywords are present or absent in a document. Thus, a doc- 
ument is judged as relevant or irrelevant; there is no concept of a partial match between 
documents and queries. This can lead to poor performance [14]. More advanced fuzzy set 
theoretic techniques try to remedy this black-white Boolean logic by introducing shades of 
gray. For example, a title search for car AND maintenance on a Boolean engine causes 
the virtual machine to return all documents that use both words in the title. A relevant doc- 
ument entitled “Automobile Maintenance” will not be returned. Fuzzy Boolean engines 
use fuzzy logic to categorize this document as somewhat relevant and return it to the user. 
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The car maintenance query example introduces the main drawbacks of Boolean 
search engines; they fall prey to two of the most common information retrieval problems, 
synonymy and polysemy. Synonymy refers to multiple words having the same meaning, 
such as car and automobile. A standard Boolean engine cannot return semantically related 
documents whose keywords were not included in the original query. Polysemy refers to 
words with multiple meanings. For example, when a user types bank as their query, does 
he or she mean a financial center, a slope on a hill, a shot in pool, or a collection of objects 
[24]? The problem of polysemy can cause many documents that are irrelevant to the user’s 
actual intended query meaning to be retrieved. Many Boolean search engines also require 
that the user be familiar with Boolean operators and the engine’s specialized syntax. For 
example, to find information about the phrase iron curtain, many engines require quo- 
tation marks around the phrase, which tell the search engine that the entire phrase should 
be searched as if it were just one keyword. A user who forgets this syntax requirement 
would be surprised to find retrieved documents about interior decorating and mining for 
iron ore. 


Nevertheless, variants of the Boolean model do form the basis for many search en- 
gines. There are several reasons for their prevalence. First, creating and programming a 
Boolean engine is straightforward. Second, queries can be processed quickly; a quick scan 
through the keyword files for the documents can be executed in parallel. Third, Boolean 
models scale well to very large document collections. Accommodating a growing collec- 
tion is easy. The programming remains simple; merely the storage and parallel processing 
capabilities need to grow. References [14, 75, 107] all contain chapters with excellent 
introductions to the Boolean model and its extensions. 


1.2.2 Vector Space Model Search Engines 


Another information retrieval technique uses the vector space model [147], developed by 
Gerard Salton in the early 1960s, to sidestep some of the information retrieval problems 
mentioned above. Vector space models transform textual data into numeric vectors and ma- 
trices, then employ matrix analysis* techniques to discover key features and connections 
in the document collection. Some advanced vector space models address the common text 
analysis problems of synonymy and polysemy. Advanced vector space models, such as LSI 
[64] (Latent Semantic Indexing), can access the hidden semantic structure in a document 
collection. For example, an LSI engine processing the query car will return documents 
whose keywords are related semantically (in meaning), e.g., automobile. This ability to 
reveal hidden semantic meanings makes vector space models, such as LSI, very powerful 
information retrieval tools. 


Two additional advantages of the vector space model are relevance scoring and rel- 
evance feedback. The vector space model allows documents to partially match a query by 
assigning each document a number between 0 and 1, which can be interpreted as the like- 
lihood of relevance to the query. The group of retrieved documents can then be sorted by 
degree of relevancy, a luxury not possible with the simple Boolean model. Thus, vec- 
tor space models return documents in an ordered list, sorted according to a relevance 
score. The first document returned is judged to be most relevant to the user’s query. 


3Mathematical terms are defined in Chapter 15, the Mathematics Chapter, and are italicized throughout. 
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Some vector space search engines report the relevance score as a relevancy percentage. 
For example, a 97% next to a document means that the document is judged as 97% rele- 
vant to the user’s query. (See the Federal Communications Commission’s search engine, 
http://www. fcc.gov/searchtools.html, which is powered by Inktomi, once known 
to use the vector space model. Enter a query such as taxes and notice the relevancy score 
reported on the right side.) Relevance feedback, the other advantage of the vector space 
model, is an information retrieval tuning technique that is a natural addition to the vec- 
tor space model. Relevance feedback allows the user to select a subset of the retrieved 
documents that are useful. The query is then resubmitted with this additional relevance 
feedback information, and a revised set of generally more useful documents is retrieved. 


A drawback of the vector space model is its computational expense. At query time, 
distance measures (also known as similarity measures) must be computed between each 
document and the query. And advanced models, such as LSI, require an expensive singu- 
lar value decomposition [82, 127] of a large matrix that numerically represents the entire 
document collection. As the collection grows, the expense of this matrix decomposition 
becomes prohibitive. This computational expense also exposes another drawback—vector 
space models do not scale well. Their success is limited to small document collections. 


Understanding Search Engines 


The informative little book by Michael Berry and Murray Browne, Understanding 
Search Engines: Mathematical Modeling and Text Retrieval [23], provides an 
excellent explanation of vector space models, especially LSI, and contains several 
examples and sample code. Our mathematical readers will enjoy this book and its 
application of linear algebra algorithms in the context of traditional information 
retrieval. 


1.2.3 Probabilistic Model Search Engines 


Probabilistic models attempt to estimate the probability that the user will find a particular 
document relevant. Retrieved documents are ranked by their odds of relevance (the ratio 
of the probability that the document is relevant to the query divided by the probability that 
the document is not relevant to the query). The probabilistic model operates recursively 
and requires that the underlying algorithm guess at initial parameters then iteratively tries 
to improve this initial guess to obtain a final ranking of relevancy probabilities. 


Unfortunately, probabilistic models can be very hard to build and program. Their 
complexity grows quickly, deterring many researchers and limiting their scalability. Prob- 
abilistic models also require several unrealistic simplifying assumptions, such as indepen- 
dence between terms as well as documents. Of course, the independence assumption is 
restrictive in most cases. For instance, in this document the most likely word to follow in- 
formation is the word retrieval, but the independence assumption judges each word 
as equally likely to follow the word information. On the other hand, the probabilistic 
framework can naturally accommodate a priori preferences, and thus, these models do of- 
fer promise of tailoring search results to the preferences of individual users. For example, a 
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user’s query history can be incorporated into the probabilistic model’s initial guess, which 
generates better query results than a democratic guess. 


1.2.4 Meta-search Engines 


There’s actually a fourth model for traditional search engines, meta-search engines, which 
combines the three classic models. Meta-search engines are based on the principle that 
while one search engine is good, two (or more) are better. One search engine may be great 
at a certain task, while a second search engine is better at another task. Thus, meta-search 
engines such as Copernic (www. copernic.com) and SurfWax (www. surfwax.com) were 
created to simultaneously exploit the best features of many individual search engines. 
Meta-search engines send the query to several search engines at once and return the re- 
sults from all of the search engines in one long unified list. Some meta-search engines 
also include subject-specific search engines, which can be helpful when searching within 
one particular discipline. For example, Monster (www.monster.com) is an employment 
search engine. 


1.2.5 Comparing Search Engines 


Annual information retrieval conferences, such as TREC [3], SIGIR, CIR [22] (for tradi- 
tional information retrieval), and WWW [4] (for web information retrieval), are used to 
compare the various information retrieval models underlying search engines and help the 
field progress toward better, more efficient search engines. The two most common rat- 
ings used to differentiate the various search techniques are precision and recall. Precision 
is the ratio of the number of relevant documents retrieved to the total number of docu- 
ments retrieved. Recall is the ratio of the number of relevant documents retrieved to the 
total number of relevant documents in the collection. The higher the precision and recall, 
the better the search engine is. Of course, search engines are tested on document collec- 
tions with known parameters. For example, the commonly used test collection Medlars 
[6], containing 5,831 keywords and 1,033 documents, has been examined so often that 
its properties are well known. For instance, there are exactly 24 documents relevant to 
the phrase neoplasm immunology. Thus, the denominator of the recall ratio for a user 
query on neoplasm immunology is 24. If only 10 documents were retrieved by a search 
engine for this query, then a recall of 10/24 = .416 is reported. Recall and precision 
are information retrieval-specific performance measures, but, of course, when evaluating 
any computer system, time and space are always performance issues. All else held con- 
stant, quick, memory-efficient search engines are preferred to slower, memory-inefficient 
engines. A search engine with fabulous recall and precision is useless if it requires 30 
minutes to perform one query or stores the data on 75 supercomputers. Some other perfor- 
mance measures take a user-centered viewpoint and are aimed at assessing user satisfaction 
and frustration with the information system. A book by Robert Korfhage, Information Stor- 
age and Retrieval [107], discusses these and several other measures for comparing search 
engines. Excellent texts for information retrieval are [14, 75, 163]. 


INTRODUCTION TO WEB SEARCH ENGINES 9 


1.3 WEB INFORMATION RETRIEVAL 
1.3.1 The Challenges of Web Search 


Tim Berners-Lee and his World Wide Web entered the information retrieval world in 1989 
[79]. This event caused a branch that focused specifically on search within this new docu- 
ment collection to break away from traditional information retrieval. This branch is called 
web information retrieval. Many web search engines are built on the techniques of tradi- 
tional search engines, but they differ in many important ways. We list the properties that 
make the Web such a unique document collection. The Web is: 


e huge, 
e dynamic, 
e self-organized, and 


e hyperlinked. 


The Web is indeed huge! In fact, it’s so big that it’s hard to get an accurate count of 
its size. By January 2004, it was estimated that the Web contained over 10 billion pages, 
with an average page size of SOOKB [5]. With a world population of about 6.4 billion, 
that’s almost 2 pages for each inhabitant. The early exponential growth of the Web has 
slowed recently, but it is still the largest document collection in existence. The Berkeley 
information project, “How Much Information,” estimates that the amount of information 
on the Web is about 20 times the size of the entire Library of Congress print collection [5]. 
Bigger still, a company called BrightPlanet sells access to the so-called Deep Web, which 
they estimate to contain over 92,000TB of data spread over 550 billion pages [1]. Bright- 
Planet defines the Deep Web as the hundreds of thousands of publicly accessible databases 
that create a collection over 500 times larger than the Surface Web. Deep webpages can 
not be found by casual, routine surfing. Surfers must request information from a particular 
database, at which point, the relevant pages are served to the user dynamically within a 
matter of seconds. As a result, search engines cannot easily find these dynamic pages since 
they do not exist before or after the query. However, Yahoo appears to be the first search 
engine aiming to index parts of the Deep Web. 


The Web is dynamic! Contrast this with traditional document collections which 
can be considered static in two senses. First, once a document is added to a traditional 
collection, it does not change. The books sitting on a bookshelf are well behaved. They 
don’t change their content by themselves, but webpages do, very frequently. A study by 
Junghoo Cho and Hector Garcia-Molina [52] in 2000 reported that 40% of all webpages in 
their dataset changed within a week, and 23% of the .com pages changed daily. In a much 
more extensive and recent study, the results of Fetterly et al. [74] concur. About 35% of 
all webpages changed over the course of their study, and also pages that were larger in size 
changed more often and more extensively than their smaller counterparts. Second, for the 
most part, the size of a traditional document collection is relatively static. It is true that 
abstracts are added to MEDLINE each year, but how many? Hundreds, maybe thousands. 
These are minuscule additions by Web proportions. Billions of pages are added to the Web 
each year. The dynamics of the Web make it tough to compute relevancy scores for queries 
when the collection is a moving, evolving target. 
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The Web is self-organized! Traditional document collections are usually collected 
and categorized by trained (and often highly paid) specialists. However, on the Web, any- 
one can post a webpage and link away at will. There are no standards and no gatekeepers 
policing content, structure, and format. The data are volatile; there are rapid updates, bro- 
ken links, and file disappearances. One 2002 U.S. study reporting on “link rot” suggested 
that up to 50% of URLs cited in articles in two information technology journals were in- 
accessible within four years [1]. The data is heterogeneous, existing in multiple formats, 
languages, and alphabets. And often this volatile, heterogeneous data is posted multiple 
times. In addition, there is no editorial review process, which means errors, falsehoods, 
and invalid statements abound. Further, this self-organization opens the door for sneaky 
spammers who capitalize on the mercantile potential offered by the Web. Spammers was 
the name originally given to those who send mass advertising emails. With one click of 
the send button, spammers can send their advertising message to thousands of potential 
customers in a matter of seconds. With web search and online retailing, this name was 
broadened to include those using deceptive webpage creation techniques to rank highly in 
web search listings for particular queries. Spammers resorted to using minuscule text font, 
hidden text (white on a white background), and misleading metatag descriptions to fool 
early web search engines (like those using the Boolean technique of traditional informa- 
tion retrieval). The self-organization of the Web also means that webpages are created for 
a variety of different purposes. Some pages are aimed at surfers who are shopping, others 
at surfers who are researching. In fact, search engines must be able to answer many types 
of queries, such as transactional queries, navigational queries, and informational queries. 
All these features of the Web combine to make the job for web search engines Herculean. 


Ah, but the Web is hyperlinked! This linking feature, the foundation of Vannevar 
Bush’s memex, is the saving grace for web search engines. Hyperlinks make the new 
national pastime of surfing possible. But much more importantly, they make focused, ef- 
fective searching a reality. This book is about ways that web search engines exploit the 
additional information available in the Web’s sprawling link structure to improve the qual- 
ity of their search results. Consequently, we focus on just one aspect of the web information 
retrieval process, but one we believe is the most exciting and important. However, the ad- 
vantages resulting from the link structure of the Web did not come without negative side 
effects. The most interesting side effects concern those sneaky spammers. Spammers soon 
caught wind of the link analysis employed by major search engines, and immediately set 
to work on link spamming. Link spammers carefully craft hyperlinking strategies in the 
hope of increasing traffic to their pages. This has created an entertaining game of cat and 
mouse between the search engines and the spammers, which many, the authors included, 
enjoy spectating. See the asides on pages 43 and 52. 


An additional information retrieval challenge for any document collection, but espe- 
cially pertinent to the Web, concerns precision. Although the amount of accessible infor- 
mation continues to grow, a user’s ability to look at documents does not. Users rarely look 
beyond the first 10 or 20 documents retrieved [94]. This user impatience means that search 
engine precision must increase just as rapidly as the number of documents is increasing. 
Another dilemma unique to web search engines concerns their performance measurements 
and comparison. While traditional search engines are compared by running tests on famil- 
iar, well studied, controlled collections, this is not realistic for web engines. Even small 
web collections are too large for researchers to catalog, count, and create estimates of the 
precision and recall numerators and denominators for dozens of queries. Comparing two 
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search engines is usually done with user satisfaction studies and market share measures in 
addition to the baseline comparison measures of speed and storage requirements. 


1.3.2 Elements of the Web Search Process 


This last section of the introductory chapter describes the basic elements of the web in- 
formation retrieval process. Their relationship to one another is shown in Figure 1.2. Our 
purpose in describing the many elements of the search process is twofold: first, it helps 
emphasize the focus of this book, which is the ranking part of the search process, and sec- 
ond, it shows how the ranking process fits into the grand scheme of search. Chapters 3-12 
are devoted to the shaded parts of Figure 1.2, while all other parts are discussed briefly in 
Chapter 2. 
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Figure 1.2 Elements of a search engine 


e Crawler Module. The Web’s self-organization means that, in contrast to traditional 
document collections, there is no central collection and categorization organization. 
Traditional document collections live in physical warehouses, such as the college’s 
library or the local art museum, where they are categorized and filed. On the other 
hand, the web document collection lives in a cyber warehouse, a virtual entity that 
is not limited by geographical constraints and can grow without limit. However, 
this geographic freedom brings one unfortunate side effect. Search engines must 
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do the data collection and categorization tasks on their own. As a result, all web 
search engines have a crawler module. This module contains the software that col- 
lects and categorizes the web’s documents. The crawling software creates virtual 
robots, called spiders, that constantly scour the Web gathering new information and 
webpages and returning to store them in a central repository. 


e Page Repository. The spiders return with new webpages, which are temporarily 
stored as full, complete webpages in the page repository. The new pages remain in 
the repository until they are sent to the indexing module, where their vital informa- 
tion is stripped to create a compressed version of the page. Popular pages that are 
repeatedly used to serve queries are stored here longer, perhaps indefinitely. 


e Indexing Module. The indexing module takes each new uncompressed page and 
extracts only the vital descriptors, creating a compressed description of the page that 
is stored in various indexes. The indexing module is like a black box function that 
takes the uncompressed page as input and outputs a “Cliffnotes” version of the page. 
The uncompressed page is then tossed out or, if deemed popular, returned to the page 
repository. 


e Indexes. The indexes hold the valuable compressed information for each webpage. 
This book describes three types of indexes. The first is called the content index. 
Here the content, such as keyword, title, and anchor text for each webpage, is stored 
in a compressed form using an inverted file structure. Chapter 2 describes the in- 
verted file in detail. Further valuable information regarding the hyperlink structure 
of pages in the search engine’s index is gleaned during the indexing phase. This 
link information is stored in compressed form in the structure index. The crawler 
module sometimes accesses the structure index to find uncrawled pages. Special- 
purpose indexes are the final type of index. For example, indexes such as the image 
index and pdf index hold information that is useful for particular query tasks. 


The four modules above (crawler, page repository, indexers, indexes) and their cor- 
responding data files exist and operate independent of users and their queries. Spiders 
are constantly crawling the Web, bringing back new and updated pages to be indexed and 
stored. In Figure 1.2 these modules are circled and labeled as query-independent. Unlike 
the preceding modules, the query module is query-dependent and is initiated when a user 
enters a query, to which the search engine must respond in real-time. 


e Query Module. The query module converts a user’s natural language query into 
a language that the search system can understand (usually numbers), and consults 
the various indexes in order to answer the query. For example, the query module 
consults the content index and its inverted file to find which pages use the query 
terms. These pages are called the relevant pages. Then the query module passes the 
set of relevant pages to the ranking module. 


e Ranking Module. The ranking module takes the set of relevant pages and ranks 
them according to some criterion. The outcome is an ordered list of webpages such 
that the pages near the top of the list are most likely to be what the user desires. 
The ranking module is perhaps the most important component of the search pro- 
cess because the output of the query module often results in too many (thousands 
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of) relevant pages that the user must sort through. The ordered list filters the less 
relevant pages to the bottom, making the list of pages more manageable for the user. 
(In contrast, the similarity measures of traditional information retrieval often do not 
filter out enough irrelevant pages.) Actually, this ranking which carries valuable, 
discriminatory power is arrived at by combining two scores, the content score and 
the popularity score. Many rules are used to give each relevant page a relevancy 
or content score. For example, many web engines give pages using the query word 
in the title or description a higher content score than pages using the query word in 
the body of the page [39]. The popularity score, which is the focus of this book, 
is determined from an analysis of the Web’s hyperlink structure. The content score 
is combined with the popularity score to determine an overall score for each rele- 
vant page [30]. The set of relevant pages resulting from the query module is then 
presented to the user in order of their overall scores. 


Chapter 2 gives an introduction to all components of the web search process, ex- 
cept the ranking component. The ranking component, specifically the popularity score, 
is the subject of this book. Chapters 3 through 12 provide a comprehensive treatment of 
the ranking problem and its suggested solutions. Each chapter progresses in depth and 
mathematical content. 


This page intentionally left blank 


Chapter Two 


Crawling, Indexing, and Query Processing 


Spiders are the building blocks of search engines. Decisions about the design of the crawler 
and the capabilities of its spiders affect the design of the other modules, such as the index- 
ing and query processing modules. 


So in this chapter, we begin our description of the basic components of a web search 
engine with the crawler and its spiders. We purposely exclude one component, the ranking 
component, since it is the focus of this book and is covered in the remaining chapters. 
The goals and challenges of web crawlers are introduced in section 2.1, and a simple 
program for crawling the Web is provided. Indexing a collection of documents as enormous 
as the Web creates special storage challenges (section 2.2), and also has search engines 
constantly increasing the size of their indexes (see the aside on page 20). The size of 
the Web makes the real-time processing of queries an astounding feat, and section 2.3 
describes the structures and mechanisms that make this possible. 


2.1 CRAWLING 


The crawler module contains a short software program that instructs robots or spiders on 
how and which pages to retrieve. The crawling module gives a spider a root set of URLs 
to visit, instructing it to start there and follow links on those pages to find new pages. 
Every crawling program must address several issues. For example, which pages should the 
spiders crawl? Some search engines focus on specialized search, and as a result, conduct 
specialized crawls, through only . gov pages, or pages with images, or blog files, etc. For 
instance, Bernhard Seefeld’s search engine, search.ch, crawls only Swiss webpages and 
stops at the geographical borders of Switzerland. Even the most comprehensive search 
engine indexes only a small portion of the entire Web. Thus, crawlers must carefully select 
which pages to visit. 


How often should pages be crawled? Since the Web is dynamic, last month’s crawled 
page may contain different content this month. Therefore, crawling is a never-ending pro- 
cess. Spiders return exhausted, carrying several new and many updated pages, only to be 
immediately given another root URL and told to start over. Theirs is an endless task like 
Sisyphus’s uphill ball-rolling. However, some pages change more often than others, so a 
crawler must decide which pages to revisit and how often. Some engines make this deci- 
sion democratically, while others refresh pages in proportion to their perceived freshness 
or importance levels. In fact, some researchers have proposed a crawling strategy that uses 
the PageRank measure of Chapters 3 and 4 to decide which pages to update [31]. 


How should pages be crawled ethically? When a spider visits a webpage, it con- 
sumes resources, such as bandwidth and hits quotas, belonging to the page’s host and the 
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Internet at large. Like outdoor activists who try to “leave no trace,” polite spiders try to 
minimize their impact. The Robots Exclusion Protocol was developed to define proper 
spidering activities and punish obnoxious, disrespectful spiders. In fact, website adminis- 
trators can use a robots.txt file to block spiders from accessing parts of their sites. 


How should multiple spiders coordinate their activities to avoid coverage overlap? 
One crawler can set several spiders loose on the Web, figuring parallel crawling can save 
time and effort. However, an optimal crawling policy is needed to insure websites are not 
visited multiple times, and thus significant overhead communication is required. 


Regardless of the ways a crawling program addresses these issues, spiders return 
with URLs for new or refreshed pages that need to be added to or updated in the search 
engine’s indexes. We discuss one index in particular, the content index, in the next section. 


Submitting a Site to Search Engines 


Like a castaway stranded on a tiny island, many webpage authors worry that a 
search engine spider might never find their webpage. This is certainly possible, 
especially if the page is about an obscure topic, and contains little content and 
few inlinks. Authors hosting a new page can check if spiders such as Googlebot 
have visited their site by viewing their web server’s log files. Most search engines 
have mechanisms to calm the fears of castaway authors. For example, Google 
offers authors a submission feature. Every webpage author can submit his or 
her site through a web form (http: //www. google.com/addurl.htm1), which 
adds the site to Google’s list of to-be-crawled URLs. While Google offers no 
guarantees on if or when the site will be crawled, this service does help both site 
authors and the Google crawler. Almost all major search engines offer a “Submit 
Your Site” feature, although some require small fees in exchange for a listing, 
featured listing, or sponsored listing in their index. 


Spidering Hacks 


Readers interested in programming their own special purpose crawler will find 
the O’Reilly book, Spidering Hacks [93], useful. This book contains 100 tips and 
tools for training a spider to do just about anything. With these tricks, your spider 
will be able to do more than just sit, roll over, and play dead; he’ll go find news 
stories about an actor, retrieve stock quotes, run an email discussion group, or find 
current topical trends on the Web. 
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Matlab Crawler m-file 


With Cleve Moler’s permission, we display the guts of his Matlab spider here. 
If you’re a programmer or curious reader who’s not squeamish around spiders or 
Matlab code, please feel free to dissect. Squeamish, code-averse readers should 
skip ahead to section 2.2. 


Versions 6.5 and later of MATLAB contain two commands, urlread and url- 
write, that enable one to write simple m-files that crawl the Web. The m-file 
below, surfer.m, begins a web crawl at a root page and continues until n 
pages have been crawled. The program creates two outputs, U, a list of the 
n crawled URLs, and L, a sparse binary adjacency matrix containing the link 
structure of the n pages. (The L matrix is related to the H PageRank matrix 
of Chapter 4.) The command urlwrite can then be used to save the con- 
tents of each retrieved URL to a file, which can then be sent to the indexing 
module of the search engine for compression. (This m-file can be downloaded 
from the website for Cleve’s book Numerical Computing with Matlab [132], 
http: //www.mathworks.com/moler/ncmfilelist.html.) 


function [U,L] = surfer(root,n); 


oe 


SURFER Create the adjacency matrix of a portion of the Web. 


[U,L] = surfer(root,n) starts at the URL root and follows 
Web links until it forms an n-by-n adjacency matrix of links. 
The output U is a cell array of the URLS visited and 


L is a sparse matrix with L(i,j) = 1 if url{i} links to url{j}. 
Example: [U,L] = surfer('’http://www.ncsu.edu’,500); 
This function currently has two defects. (1) The algorithm for 


finding links is naive. We just look for the string ‘http:’. 
(2) An attempt to read from a URL that is accessible, but very 
slow, might take an unacceptably long time to complete. In 
some cases, it may be necessary to have the operating system 
terminate MATLAB. Key words from such URLS can be added to the 
skip list in surfer.m. 


Initialize 


U = cell(n,1); 

hash = zeros(n,1); 

L = logical (sparse(n,n) ); 
m= 1; 

U{m} = root; 

hash(m) = hashfun(root) ; 
for- j=. ian 
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6 Try to open a page. 
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CHAPTER 2 

try 

disp([’open ’ num2str(j) ’ ’ U{j}]) 

page = urlread(U{j}); 
catch 

disp([’fail ’ num2str(j) ’ ’ U{j}]) 

continue 
end 


2. 


% Follow the links from the open page. 


for f = findstr(‘’http:’,page); 


2 


% A link starts with ‘http:’ and ends with next double quote. 


e = min(findstr(’"’,page(f:end))); 

if isempty(e), continue, end 

url = deblank(page(f:f+e-2)); 

url(url<et “ys hes % Nonprintable characters 
if url(end) == '/’, url(end) = []; end 


% Look for links that should be skipped. 


skips = {’.gif’,’.jpg’,’.pdf’,’.css’,’lmscadsi’,’cybernet’,... 
’search.cgi’,’.ram’, ’www.w3.org’, ‘ 
'scripts’, ‘netscape’, ‘shockwave’, 'webex’,’fansonly’}; 
skip = any(url=='!') | any(url=='?'); 
k = 0; 
while ~skip & (k < length(skips) ) 
k = k+1; 
skip = ~isempty(findstr (url, skips{k})); 
end 
if skip 
if isempty(findstr(url,’.gif’)) & 
isempty(findstr(url,’.jpg’)) 
disp ([’ skip ’ url]) 
end 
continue 
end 


2 


% Check if page is already in url list. 


£. Ss i0e 
for k = find(hash(1:m) == hashfun(url))’; 
if isequal (U{k},url) 
i =k; 
break 
end 
end 


2 


% Add a new url to the graph there if are fewer than n. 


LP ti se 00 oR te A) 
m = m+1; 
U{m} = url; 
hash(m) = hashfun(url) ; 
i =m; 
end 


2 


% Add a new link. 
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if i>oO 
disp([’ lank.” .int2str(i) wri ]) 
L(i,j) = 1; 


% Almost unique numeric hash code for pages already visited. 
h = length(url) + 1024*sum(url) ; 


2.2 THE CONTENT INDEX 


Each new or refreshed page that a spider brings back is sent to the indexing module, where 
software programs parse the page content and strip it of its valuable information, so that 
only the essential skeleton of the page is passed to the appropriate indexes. Valuable infor- 
mation is contained in title, description, and anchor text as well as in bolded terms, terms 
in large font, and hyperlinks. One important index is the content index, which stores the 
textual information for each page in compressed form. An inverted file, which is used to 
store this compressed information, is like the index in the back of a book. Next to each 
term is a list of all locations where the term appears. In the simplest case, the location is 
the page identifier. An inverted file might look like: 


e term | (aardvark) - 3, 117, 3961 


e term 10 (aztec) - 3, 15, 19, 101, 673, 1199 
e term 11 (baby) - 3, 31, 56, 94, 673, 909, 11114, 253791 


e term m (zymurgy) - 1159223 


This means that term | is used in webpages 3, 117, and 3961. It is clear that an advantage 
of the inverted file is its use as a quick lookup table. Processing a query on term 11 begins 
by consulting the inverted list for term 11. 


The simple inverted file, a staple in traditional information retrieval [147], does pose 
some challenges for web collections. Because multilingual terms, phrases, and proper 
names are used, the number of terms m, and thus the file size, is huge. Also, the number 
of webpages using popular broad terms such as weather or sports is large. Therefore, the 
number of page identifiers next to these terms is large and consumes storage. Further, 
page identifiers are usually not the only descriptors stored for each term. See section 
2.3. Other descriptors such as the location of the term in the page (title, description, or 
body) and the appearance of the term (bolded, large font, or in anchor text) are stored 
next to each page identifier. Any number of descriptors can be used to aid the search 
engine in retrieving relevant documents. In addition, as pages change content, so must 
their compressed representation in the inverted file. Thus, an active area of research is the 
design of methods for efficiently updating indexes. Lastly, the enormous inverted file must 
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be stored on a distributed architecture, which means strategies for optimal partitioning 
must be designed. 


ASIDE: Indexing Wars 


While having a larger index of webpages accessed does not necessarily make one 
search engine better than another, it does mean the “bigger” search engine has a better op- 
portunity to return a longer list of relevant results, especially for unusual queries. As a result, 
search engines are constantly battling for the title of “The World’s Largest Index.” Reporters 
writing for The Search Engine Showdown or Search Engine Watch enjoy charting the chang- 
ing leaders in the indexing war. Figure 2.1 shows how self-reported search engine sizes have 
changed over the years. 
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Google, whose name is a play on googol, the 
word for the number 101%, entered the search mar- 
ket in 1998 and immediately grew, dethroning AIl- 
taVista and claiming the title of the World’s Largest 
Index. In 2002, AlltheWeb snatched the title from 
Google by declaring it had reached the two billion 
mark. Google soon regained the lead by indexing 
three billion pages. AlltheWeb and Inktomi quickly 
upped their sizes to hit this same mark. The search 
latecomer Teoma has been steadily growing its index 
since its debut in early 2002. Web search engines use 
elaborate schemes, structures, and machines to store 
their massive indices. In fact, in 2003, Google used Fi 

at igure 2.2 Google servers 

a network of over 15,000 computers to store their in- ©Timothy Archibald, 2006 

dex [19], which in November 2004 jumped from 4.3 billion to 8.1 billion webpages. The 
number of servers used today is at least an order of magnitude higher. Figure 2.2 shows part 
of the server system that is housed in the Googleplex Mountain View, California site. Google 
history buffs can see the dramatic evolution of Google’s server system by viewing pictures of 
their original servers that used a Lego-constructed cabinet to house disk drives and cooling 
fans (http: //www-db. stanford. edu/pub/voy/museum/pictures/display/0-4-Google. htm). 
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The Internet Archive Project 


In 1996 a nonprofit organization called the Internet Archive took on the ar- 
duous task of archiving the Web’s contents—pages, images, video files, audio 
files, etc. This project archives old versions of pages, pages that are now ex- 
tinct, as well as current pages. For example, to view the previous versions of 
author Carl Meyer’s homepage, use the Internet Archive’s Wayback Machine 
(http: //web.archive.org/). Enter the address for Carl’s current homepage, 
http://meyer.math.ncsu.edu/, and the Wayback machine returns archived 
versions and the dates of updates to this page. A temporary addition to the Archive 
website was a beta version of Anna Patterson’s Recall search engine. Because this 
engine was tailored to archival search, it had some novel features such as time- 
series plots of the relevancy of search terms over time. (Perhaps such features 
will become commonplace in mainstream engines, as Patterson now works for 
Google.) One of the archive’s goals is to make sure information on ephemeral 
pages is not lost forever because valuable trends and cultural artifacts exist in 
such pages. The archive also allows for systematic tracking of the Web’s evo- 
lution. Of course, as the Internet Archive Project continues to grow and receive 
support, it will inevitably claim the undisputed title of Index King, and hold the 
world’s largest document collection. 


2.3 QUERY PROCESSING 


Unlike the crawler and indexing modules of a search engine, the query module’s operations 
depend on the user. The query module must process user queries in real-time, and return re- 
sults in milliseconds. In February 2003, Google reported serving 250 million searches per 
day, while Overture and Inktomi handled 167 and 80 million, respectively [156]. Google 
likes to keep their processing time under half a second. In order to process a query this 
quickly, the query module accesses precomputed indexes such as the content index and the 
structure index. 


Consider an example that uses the inverted file below, which is copied from section 
2.2. 


e term | (aardvark) — 3, 117, 3961 


e term 10 (aztec) — 3, 15, 19, 101, 673, 1199 
e term 11 (baby) — 3, 31, 56, 94, 673, 909, 11114, 253791 


e term m (zymurgy) — 1159223 


Suppose a user enters the unusual query of aztec baby, and the search engine assumes 
the Boolean AND is used. Then the query module consults the inverted lists for aztec, 
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which is term 10, and baby, which is term 11. The resulting set of “on topic” or relevant 
pages is {3, 673} because these pages use both query terms. Many traditional search 
engines stop here, returning this list to the user. However, for broad queries on the vast 
web collection, this set of relevant pages can be huge, containing hundreds of thousands of 
pages. Therefore, rankings are imposed on the pages in this set to make the list of retrieved 
pages more manageable. Consequently, the query module passes its list of relevant pages 
to the ranking module, which creates the list of pages ordered from most relevant to least 
relevant. The ranking module accesses precomputed indexes to create a ranking at query- 
time. In Chapter 1, we mentioned that search engines combine content scores for relevant 
pages with popularity scores to generate an overall score for each page. Relevant pages are 
then sorted by their overall scores. 


We describe the creation of the content score with an example that also shows how 
the inverted file can be expanded to include more information. Suppose document 94 is 
updated by its author and now contains information about term 10 (aztec). This means 
that the inverted file must be updated, with the document identifier of 94 added to the 
list of pages recorded next to term 10. However, suppose that rather than storing just the 
document identifier, we decide to store three additional pieces of information. First, we 
record whether the term in question (aztec) appears in the title. Second, we record the 
term’s appearance (or not) in the description metatag. Finally, we record a count of the 
number of times the term appears in the page. One way to record this information is to 
append a vector to the document identifier for page 94 as follows: 


term 10 (aztec) — 3,15,19,94[1,0, 7], 101, 673, 1199. 


In the vector [1,0,7], the 1 means that term 10 appears in the title tag of page 94, the 0 
means that term 10 does not appear in the description tag of page 94, and the 7 means 
that term 10 occurred seven times in page 94. Similar information must be added to each 
element in the inverted file. That is, for every term, a three-dimensional vector must be 
inserted after each page identifier. While more work must be done by the indexing module 
and more storage used by the content index, the additional content information makes the 
search engine much better at processing queries. This is achieved by creating a content 
score for each page in the relevant set, which is now {3, 94, 673} in our example. At query 
time, the query module consults the inverted file, and for each document in the relevant 
set, pulls off the document identifier along with its appended three-dimensional vector. 
Suppose the result is: 


term 10 (aztec) — 3[1, 1, 27], 94 [1, 0, 7], 673 [0, 0, 3] 

term 11 (baby) — 3 [1, 1, 10], 94 [0, 0, 5], 673 [1, 1, 14] 
Heuristics or rules are now applied to determine an content score for documents 3, 94, and 
673. One elementary heuristic adds the values in the three-dimensional vector for term 
10/page 3 and multiplies this by the sum of the values in the vector for term 11/page 3. 
Thus, the content scores for the three relevant pages are: 

content score (page 3) = (1 + 1+ 27) x (1+1+4 10) = 348, 
content score (page 94) = (1+0+7) x (0+0+5) = 40, 
content score (page 673) = (0 +0+ 3) x (14+1+414) =48. 


Different schemes exist with many other factors making up the content score [30]. 
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The content score can be computed solely from the content index and its inverted 
file, and is query-dependent. On the other hand, the popularity score is computed solely 
from the structure index, and is usually query-independent. The remainder of this book 
is devoted to the popularity score, so we postpone its description and computation until 
later. For now, we merely state that each page on the Web has a popularity score, which 
is independent of user queries and which gives a global measure of that page’s popularity 
within the search engine’s entire index of pages. This popularity score is then combined 
with the content score, for example, by multiplication, to create an overall score for each 
relevant page for a given query. 


Lord Campbell’s Motion to Index 


John Campbell (1799-1861) was a Scottish lawyer and politician who became 
Lord Chancellor of Great Britain in 1859. In the preface to volume 3 of his book, 
Lives of the Chief Justices [45], Lord Campbell writes: 


So essential do I consider an Index to be to every book, that I proposed to bring 
a Bill into Parliament to deprive an author who publishes a book without an 
Index of the privilege of copyright; and, moreover, to subject him, for his 
offence, to a pecuniary penalty. 


Unfortunately, his bill was never enacted, perhaps because Parliamentary mem- 
bers and their constituents wanted to shirk the responsibility and effort associated 
with creating a good index. 


Appeals similar to Lord Campbell’s have been made by the web community and 
its indexers. W3C, the World Wide Web Consortium, has been pushing for a more 
rigorous structure for HTML documents (e.g., XML documents and RSS code) 
that will allow the indexers of search engines to more accurately and quickly pull 
the essential elements from documents. On the other hand, the Web’s lack of 
structure is recognized universally as a source of its strength and a major contrib- 
utor to its many creative uses. In an attempt to outline a balance between structure 
and freedom, in July 1997, former President Bill Clinton wrote the “Framework 
for Global Electronic Commerce,” which advocated a laissez-faire attitude toward 
web legislation and regulation. 
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Chapter Three 
Ranking Webpages by Popularity 


Nobody wants to be picked last for teams in gym class. Likewise, nobody wants their 
webpage to appear last in the list of relevant pages for a search query. As a result, many 
grown-ups transfer their high school wishes to be the “Most Popular” to their webpages. 
The remainder of this book is about the popularity contests that search engines hold for 
webpages. Specifically, it’s about the popularity score, which is combined with the tra- 
ditional content score of section 2.3 to rank retrieved pages by relevance. By 1998, the 
traditional content score was buckling under the Web’s massive size and the death grip 
of spammers. In 1998, the popularity score came to the rescue of the content score. The 
popularity score became a crucial complement to the content score and provided search en- 
gines with impressively accurate results for all types of queries. The popularity score, also 
known as the importance score, harnesses the information in the immense graph created by 
the Web’s hyperlink structure. Thus, models exploiting the Web’s hyperlink structure are 
called link analysis models. The impact that these link analysis models have had is truly 
awesome. Since 1998, the use of web search engines has increased dramatically. In fact, 
an April 2004 survey by Websense, Inc., reported that half of the respondents would rather 
forfeit their habitual morning cup of coffee than their connectivity. That’s because today’s 
search tools allow users to answer in seconds queries that were impossible just a decade 
ago (from fun searches for pictures, quotes, and snooping amateur detective work to more 
serious searches for academic research papers and patented inventions). In this chapter, we 
introduce the intuition behind the classic link analysis systems of PageRank [40] and HITS 
[106]. 


3.1 THE SCENE IN 1998 


The year 1998 was a busy year for link analysis models. At IBM Almaden in Silicon Valley, 
a young scientist named Jon Kleinberg, now a professor at Cornell 
University, was working on a Web search engine project called HITS, 
an acronym for Hypertext Induced Topic Search. His algorithm used 
the hyperlink structure of the Web to improve search engine results, 
an innovative idea at the time, as most search engines used only tex- 
tual content to return relevant documents. He presented his work 
[106], begun a year earlier at IBM, in January 1998 at the Ninth An- 
nual ACM-SIAM Symposium on Discrete Algorithms held in San 
Francisco, California. 


Very nearby, at Stanford University, two computer science doctoral candidates were 
working late nights on a similar project called PageRank. Sergey Brin and Larry Page had 
been collaborating on their Web search engine since 1995. By 1998, things were really 
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starting to accelerate for these two scientists. They were using their dorm rooms as offices 
for their fledgling business, which later became the giant Google. By August 1998, both 
Brin (right) and Page (left) took a leave of absence from Stanford in order to focus on 
their growing business. In a public presentation at the Seventh International World Wide 
Web conference (WWW98) in Brisbane, Australia, their 
paper “The anatomy of a large-scale hypertextual Web 
search engine” [39] made small ripples in the informa- 
tion science community that quickly turned into waves. 
It appears that HITS and PageRank were developed in- 
dependently despite the close geographic and temporal 
proximity of the discoveries. The connections between 
the two models are striking (see [110]). Nevertheless, since that eventful year, PageRank 
has emerged as the dominant link analysis model, partly due to its query-independence 
(see section 3.3), its virtual immunity to spamming, and Google’s huge business success. 
Kleinberg was already making a name for himself as an innovative academic, and unlike 
Brin and Page did not try to develop HITS into a company. However, later entrepreneurs 
did, thus giving HITS its belated and deserving claim to commercial success. The search 
engine Teoma uses an extension of the HITS algorithm as the basis of its underlying tech- 
nology [150]. Incidentally, Google has kept Brin and Page famously busy and wealthy 
enough to remain on leave from Stanford, as well as make their debut break into People’s 
June 28th List of the 50 Hottest Bachelors of 2004. 


3.2 TWO THESES 


In this section, we describe the theses underlying both PageRank and HITS. In order to do 
that, we need to define the Web as a graph. The Web’s hyperlink structure forms a massive 
directed graph. The nodes in the graph represent webpages and the directed arcs or links 
represent the hyperlinks. Thus, hyperlinks into a page, which are called inlinks, point into 
nodes, while outlinks point out from nodes. Figure 3.1 shows a tiny, artificial document 
collection consisting of six webpages. 


inlink to node 2 oe 


outlink from node 1 ? 


Node (e.g. pert ; 


Figure 3.1 Directed graph representing web of six pages 
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Maps of the Web Graph 


The massive web graph has little resemblance to the clean tiny graph of Figure 
3.1. Instead, the Web’s nodes and arcs create a jumbled mess that’s a headache 
to untangle and present in a visually appealing and meaningful way. Fortunately, 
many researchers have succeeded. The Atlas of Cyberspace [62] presents over 
300 colorful, informative maps of cyberactivities. With permission, we present 
a hyperlink graph that was the graduate work of Stanford’s Tamara Munzner. 
She used three-dimensional hyperbolic spaces to produce the map on the left side 
of Figure 3.2. Munzner’s ideas were implemented by Young Hyun in his Java 
software program called Walrus (even though the pictures it draws look like jelly- 
fish.) The right side of Figure 3.2 is one of Hyun’s maps with 535,102 nodes and 
601,678 links. 


Figure 3.2 Munzner’s and Hyun’s maps of subsets of the Web 


3.2.1 PageRank 


Before 1998, the web graph was largely an untapped source of information. While re- 
searchers like Kleinberg and Brin and Page recognized this graph’s potential, most people 
wondered just what the web graph had to do with search engine results. The connection is 
understood by viewing a hyperlink as a recommendation. A hyperlink from my homepage 
to your page is my endorsement of your page. Thus, a page with more recommendations 
(which are realized through inlinks) must be more important than a page with a few in- 
links. However, similar to other recommendation systems such as bibliographic citations 
or letters of reference, the status of the recommender is also important. For example, one 
personal endorsement from Donald Trump probably does more to strengthen a job applica- 
tion than 20 endorsements from 20 unknown teachers and colleagues. On the other hand, if 
the job interviewer learns that Donald Trump is very free and generous with his praises of 
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employees, and he (or his secretary) has written over 40,000 recommendations in his life, 
then his recommendation suddenly drops in weight. Thus, weights signifying the status of 
a recommender must be lowered for recommenders with little discrimination. In fact, the 
weight of each endorsement should be tempered by the total number of recommendations 
made by the recommender. 


Actually, this is exactly how Google’s PageRank popularity score works. This 
PageRank score is very famous, even notoriously so (see the asides on pages 52 and 112). 
Literally hundreds of papers have been written about it, and this book is one of the first of 
the undoubtedly many that is devoted to PageRank’s methodology, mechanism, and com- 
putation. In the coming chapters we will reveal many reasons why PageRank has become 
so popular, but one of the most convincing reasons for studying the PageRank score is 
Google’s own admission of its impact on their successful technology. According to the 
Google website (http: //www.google.com/technology/index.htm1) “the heart of 
[Google’s] software is PageRank . . . [which] continues to provide the basis for all of [our] 
web search tools.” 


In short, PageRank’s thesis is that a webpage is important if it is pointed to by other 
important pages. Sounds circular, doesn’t it? We will see in Chapter 4 that this can be 
formalized in a beautifully simple mathematical formula. 


Google Toolbar 


Comparing the PageRank scores for two pages gives an indication of the relative 
importance of the two pages. However, Google guards the exact PageRank scores 
for the pages in its index very carefully, and for good reason. (See the aside on 
page 52.) Google does graciously provide public access to a very rough approxi- 
mation of their PageRank scores. These approximations are available through the 
Google toolbar, which can be downloaded at http: //toolbar.google.com/. 
The toolbar, which then resides on the browser, displays a lone bar graph show- 
ing the approximate PageRank for the current page. The displayed PageRank 
is an integer from 0 to 10, with the most important pages receiving a PageRank 
of 10. The toolbar automatically updates this display for each page you visit. 
Thus, it must send information about the page you’re viewing to the Google 
servers. Google’s privacy policy states that it does not collect information that 
directly identifies you (e.g., your name or email address) and will not sell any 
information. For those still concerned with their privacy, Google allows users 
to disable the PageRank feature while still maintaining the functionality of the 
other toolbar features. There’s a way to access the approximate PageRank 
scores without getting the Toolbar—visit http://www. seochat.com/seo- 
tools/PageRank-search/, enter a query, and view the PageRank bar graphs 
next to the results. Readers can get a feel for PageRank by locating high 
PageRank pages (e.g., www.espn.com with a 9/10 score) and low PageRank 
pages (http: //www.csc.ncsu.edu:8080/nsmc2003/ with a 0/10 score). 
Google’s homepage (www.google.com) has a PageRank of 10, perhaps auto- 
matically set. Google sets the PageRank of pages identified to be authored by 
spammers to 0 [160], a value known among spammers as the horrifying PRO. 
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3.2.2 HITS 


Kleinberg’s HITS method for ranking pages is very similar to PageRank, but it uses both 
inlinks and outlinks to create two popularity scores for each page. HITS defines hubs and 
authorities. A page is considered a hub if, similar in some respects to an airline hub, it con- 
tains many outlinks. With an equally descriptive term, a page is called an authority if it has 
many inlinks. Of course, a page can be both a hub and an authority, just as the Hartsfield— 


\ 


Figure 3.3 A hub node and an authority node 


Jackson Atlanta airport has many incoming and outgoing flights each hour. Thus, HITS 
assigns both a hub score and an authority score to each page. Very similar to PageRank’s 
lone circular thesis, HITS has a pair of interdependent circular theses: a page is a good 
hub (and therefore deserves a high hub score) if it points to good authorities, and a page is 
a good authority if it is pointed to by good hubs. Like PageRank, these circular definitions 
create simple mathematical formulas. Readers will have to wait until Chapter 11 to hear 
the details of HITS. Although developed during 1997-1998, HITS was not incorporated 
into a commercial search engine until 2001 when the search newcomer Teoma adopted it 
as the heart of its underlying technology [150]. Check out Teoma at www.teoma.com, 
and notice that the pages listed as “Results” correspond to HITS authorities and the pages 
listed under “Resources” correspond to HITS hubs. 


Inlink Feature 


Of course, every webpage author knows exactly how many outlinks his or 
her page has and to which pages these outlinks point. However, getting a 
hold on inlink counts and inlinking pages is not as obvious. Fortunately, with 
the help of third party services, it is equally easy to uncover this inlink in- 
formation. For example, Google’s link: feature can be used to see how 
many and which, if any, important pages point to yours. Try typing link: 
http: //www4.ncsu.edu:8030/~anlangvi into Google’s input box and no- 
tice the modest number of inlinks to Amy’s homepage. (If you like this book 
and our research, you can help both of us improve our popularity scores with rec- 
ommendations through hyperlinks. We prefer inlinks from authors of important 
pages. Of course, we joke in this parenthetical comment but our comments fore- 
shadow some of the exciting and serious side effects associated with link analysis. 
See the asides on search engine optimization and link farms on pages 43 and 52, 
respectively.) To find out how many inlinks your page has in the indexes of other 
search engines, go to http: //www. top25web.com/cgi-bin/linkpop.cgi. 
This website also provides other tools, such as a ranking report and PageRank 
score that reports the Toolbar scores for several pages at once. 
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3.3 QUERY-INDEPENDENCE 


It is now time to emphasize the word query-independence. A ranking is called query- 
independent if the popularity score for each page is determined off-line, and remains con- 
stant (until the next update) regardless of the query. This means at query time, when mil- 
liseconds are precious, no time is spent computing the popularity scores for relevant pages. 
The scores are merely “looked up” in the previously computed popularity table. This can 
be contrasted with the traditional information retrieval scores of section 2.3, which are 
query-dependent. We will see that popularity scoring systems can be classified as either 
query-independent or query-dependent. This classification is important because it imme- 
diately reveals a system’s advantages and disadvantages. PageRank is query-independent, 
which means it produces a global ranking of the importance of all pages in Google’s index 
of 8.1 billion pages. On the other hand, HITS in its original version is query-dependent. 
Both PageRank and HITS can be modified to change their classifications. See Chapter 11. 


Who Links to Whom 


The science of who links to whom has extended beyond the Web to a variety 
of other networks that collectively go by the name of complex systems. Graph 
techniques have successfully been applied to learn valuable information about 
networks ranging from the AIDS transmission and power grid networks to terror- 
ist and email networks. The recent book by Barabasi, Linked: The New Science 
of Networks [16], contains a fascinating and entertaining introduction to these 
complex systems. 


Chapter Four 


The Mathematics of Google’s PageRank 


The famous and colorful mathematician Paul Erdos (1913-96) talked about The Great 
Book, a make-believe book in which he imagined God kept the world’s most elegant and 
beautiful proofs. In 2002, Graham Farmelo of London’s Science Museum edited and con- 
tributed to a similar book, a book of beautiful equations. It Must Be Beautiful: Great 
Equations of Modern Science [73] is a collection of 11 essays about the greatest scientific 
equations, equations like E = hf and E = mc?. The contributing authors were invited 
to give their answers to the tough question of what makes an equation great. One author, 
Frank Wilczek, included a quote by Heinrich Hertz regarding Maxwell’s equation: 


One cannot escape the feeling that these mathematical formulae have an inde- 
pendent existence and an intelligence of their own, that they are wiser than we 
are, wiser even than their discoverers, that we get more out of them than was 
originally put into them. 


While we are not suggesting that the PageRank equation presented in this chapter, 
nm =n" (aS +(1—a)E), 


deserves a place in Farmelo’s book alongside Einstein’s theory of relativity, we do find 
Hertz’s statement apropos. One can get a lot of mileage from the simple PageRank formula 
above—Google certainly has. Since beauty is in the eye of the beholder, we’ll let you 
decide whether or not the PageRank formula deserves the adjective beautiful. We hope the 
next few chapters will convince you that it just might. 


In Chapter 3, we used words to present the PageRank thesis: a page is important 
if it is pointed to by other important pages. It is now time to translate these words into 
mathematical equations. This translation reveals that the PageRank importance scores are 
actually the stationary values of an enormous Markov chain, and consequently Markov 
theory explains many interesting properties of the elegantly simple PageRank model used 
by Google. 


This is the first of the mathematical chapters. Many of the mathematical terms in 
each chapter are explained in the Mathematics Chapter (Chapter 15). As terms that appear 
in the Mathematics Chapter are introduced in the text, they are italicized to remind you 
that definitions and more information can be found in Chapter 15. 


32 CHAPTER 4 


4.1 THE ORIGINAL SUMMATION FORMULA FOR PAGERANK 


Brin and Page, the inventors of PageRank,' began with a simple summation equation, 
the roots of which actually derive from bibliometrics research, the analysis of the citation 
structure among academic papers. The PageRank of a page P;, denoted r(P;), is the sum 
of the PageRanks of all pages pointing into P;. 


(P= > mii) (4.1.1) 


P; eBp, 


where Bp, is the set of pages pointing into P; (backlinking to P; in Brin and Page’s words) 
and |P;| is the number of outlinks from page P;. Notice that the PageRank of inlinking 
pages r(P;) in equation (4.1.1)) is tempered by the number of recommendations made 
by P;, denoted |P;|. The problem with equation (4.1.1) is that the r(P;) values, the 
PageRanks of pages inlinking to page P;, are unknown. To sidestep this problem, Brin 
and Page used an iterative procedure. That is, they assumed that, in the beginning, all 
pages have equal PageRank (of say, 1/n, where n is the number of pages in Google’s in- 
dex of the Web). Now the rule in equation (4.1.1) is followed to compute r(P;) for each 
page P; in the index. The rule in equation (4.1.1) is successively applied, substituting the 
values of the previous iterate into r(P;). We introduce some more notation in order to 
define this iterative procedure. Let r,..(P;) be the PageRank of page P; at iteration k+ 1. 


Then, 
rK(P;) 
rpyi(P;) = es TPL (4.1.2) 
P;EBp, J 


This process is initiated with ro(P;) = 1/n for all pages P; and repeated with the hope 
that the PageRank scores will eventually converge to some final stable values. Applying 
equation (4.1.2) to the tiny web of Figure 4.1 gives the following values for the PageRanks 
after a few iterations. 


Figure 4.1 Directed graph representing web of six pages 


'The patent for PageRank was filed in 1998 by Larry Page and granted in 2001 (US Patent #6285999), and 
thus the name PageRank has a double reference to both webpages and one of its founding fathers. 
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Table 4.1 First few iterates using (4.1.2) on Figure 4.1 


Iteration 0 Iteration | Iteration 2 Rank at Iter. 2 
ro(P1) = 1/6 r1(P,) = 1/18 ro(P,) = 1/36 5 
ro(P2) = 1/6 r1(P2) = 5/36 ro(P2) = 1/18 4 
ro(P3) = 1/6 r1(P3) = 1/12 ro(P3) = 1/36 5 
ro(P4) =1/6 = ri (P4) = 1/4 ro(Ps) = 17/72 1 
ro(Ps) = 1/6 r1(Ps) = 5/36 ro(Ps) = 11/72 3 
ro(Ps) = 1/6 r1(Ps) = 1/6 ro(Pg) = 14/72 2 


4.2 MATRIX REPRESENTATION OF THE SUMMATION EQUATIONS 


Equations (4.1.1) and (4.1.2) compute PageRank one page at a time. Using matrices, we 
replace the tedious 5> symbol, and at each iteration, compute a PageRank vector, which 
uses a single 1 x n vector to hold the PageRank values for all pages in the index. In order to 
do this, we introduce an n x n matrix H and a 1 x n row vector 7”. The matrix H is a row 
normalized hyperlink matrix with H;; = 1/|P,| if there is a link from node i to node 7, and 
0, otherwise. Although H has the same nonzero structure as the binary adjacency matrix 
for the graph (called L in the Matlab Crawler m-file on page 17), its nonzero elements are 
probabilities. Consider once again the tiny web graph of Figure 4.1. 


The H matrix for this graph is 


P, Po Ps Py Ps Pe 
PR f/0 1/2 172 0 0 O 
P» 0 0 0 0 0 0 
P,; |1/3 1/3 0 0 1/3 0 
mB |o0 0 0 0 1/2 1/2 
P| 0 0 0 1/72 0 1/2 
Ps 0 0 0 1 0 0 


The nonzero elements of row 7 correspond to the outlinking pages of page i, whereas 
the nonzero elements of column i correspond to the inlinking pages of page 7. We now 
introduce a row vector w*)", which is the PageRank vector at the k*” iteration. Using this 
matrix notation, equation (4.1.2) can be written compactly as 


mkt OT — Ah)T (4.2.1) 
If you like, verify with the example of Figure 4.1 that the iterates of equation (4.2.1) match 
those of equation (4.1.2). 


Matrix equation (4.2.1) yields some immediate observations. 


1. Each iteration of equation (4.2.1) involves one vector-matrix multiplication, which 
generally requires O(n?) computation, where n is the size of the square matrix H. 


2. H is a very sparse matrix (a large proportion of its elements are 0) because most 
webpages link to only a handful of other pages. Sparse matrices, such as the one 
shown in Figure 4.2, are welcome for several reasons. First, they require minimal 
storage, since sparse storage schemes, which store only the nonzero elements of the 
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Figure 4.2 Example of a sparse matrix. The nonzero elements are indicated by pixels. 


matrix and their location [145], exist. Second, vector-matrix multiplication involving 
a sparse matrix requires much less effort than the O(n) dense computation. In fact, 
it requires O(nnz(H)) computation, where nnz(H) is the number of nonzeros in 
H. Estimates show that the average webpage has about 10 outlinks, which means 
that H has about 10n nonzeros, as opposed to the n? nonzeros in a completely dense 
matrix. This means that the vector-matrix multiplication of equation (4.2.1) reduces 
to O(n) effort. 


3. The iterative process of equation (4.2.1) is a simple linear stationary process of the 
form studied in most numerical analysis classes [82, 127]. In fact, it is the classical 
power method applied to H. 


4. H. looks a lot like a stochastic transition probability matrix for a Markov chain. 
The dangling nodes of the network, those nodes with no outlinks, create 0 rows in 
the matrix. All the other rows, which correspond to the nondangling nodes, create 
stochastic rows. Thus, H is called substochastic. 


These four observations are important to the development and execution of the 
PageRank model, and we will return to them throughout the chapter. For now, we spend 
more time examining the iterative matrix equation (4.2.1). 


4.3 PROBLEMS WITH THE ITERATIVE PROCESS 


Equation (4.2.1) probably caused readers, especially our mathematical readers, to ask sev- 
eral questions. For example, 


e Will this iterative process continue indefinitely or will it converge? 


e Under what circumstances or properties of H is it guaranteed to converge? 
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e Will it converge to something that makes sense in the context of the PageRank prob- 
lem? 


e Will it converge to just one vector or multiple vectors? 
e Does the convergence depend on the starting vector rT? 


e Ifit will converge eventually, how long is “eventually”? That is, how many iterations 
can we expect until convergence? 


We’ ll answer these questions in the next few sections. However, our answers depend on 
how Brin and Page chose to resolve some of the problems they encountered with their 
equation (4.2.1). 


Brin and Page originally started the iterative process with mT = 1 /ne™, where 
e” is the row vector of all 1s. They immediately ran into several problems when using 
equation (4.2.1) with this initial vector. For example, there is the problem of rank sinks, 
those pages that accumulate more and more PageRank at each iteration, monopolizing the 
scores and refusing to share. In the simple example of Figure 4.3, the dangling node 3 
is a rank sink. In the more complicated example of Figure 4.1, the cluster of nodes 4, 5, 


Figure 4.3 Simple graph with rank sink 


and 6 conspire to hoard PageRank. After just 13 iterations of equation (4.2.1), m¢8)7 = 
(0 0 0 2/3 1/3 1/5). This conspiring can be malicious or inadvertent. (See the 
asides on search engine optimization and link farms on pages 43 and 52, respectively.) 
The example with w!8)7 also shows another problem caused by sinks. As nodes hoard 
PageRank, some nodes may be left with none. Thus, ranking nodes by their PageRank 
values is tough when a majority of the nodes are tied with PageRank 0. Ideally, we’d 
prefer the PageRank vector to be positive, i.e., contain all positive values. 


There’s also the problem of cycles. Consider the simplest case in Figure 4.4. Page 


C= 6) 


Figure 4.4 Simple graph with cycle 


1 only points to page 2 and vice versa, creating an infinite loop or cycle. Suppose the 
iterative process of equation (4.2.1) is run with 77 = (1 0). The iterates will not 
converge no matter how long the process is run. The iterates 7") flip-flop indefinitely 
between(1 0) whenkisevenand(0 1) when k is odd. 
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4.4 A LITTLE MARKOV CHAIN THEORY 


Before we get to Brin and Page’s adjustments to equation (4.2.1), which solve the problems 
of the previous section, we pause to introduce a little theory for Markov chains. (We urge 
readers who are less familiar with Markov chains to read the Mathematics Chapter, Chapter 
15, before proceeding.) In observations 3 and 4, we noted that equation (4.2.1) resembled 
the power method applied to a Markov chain with transition probability matrix H. These 
observations are very helpful because the theory of Markov chains is well developed,” and 
very applicable to the PageRank problem. With Markov theory we can make adjustments 
to equation (4.2.1) that insure desirable results, convergence properties, and encouraging 
answers to the questions on page 34. In particular, we know that, for any starting vector, the 
power method applied to a Markov matrix P converges to a unique positive vector called 
the stationary vector as long as P is stochastic, irreducible, and aperiodic. (Aperiodicity 
plus irreducibility implies primitivity.) Therefore, the PageRank convergence problems 
caused by sinks and cycles can be overcome if H is modified slightly so that it is a Markov 
matrix with these desired properties. 


Markov properties affecting PageRank 


A unique positive PageRank vector exists when the Google matrix is stochastic 
and irreducible. Further, with the additional property of aperiodicity, the power 
method will converge to this PageRank vector, regardless of the starting vector 
for the iterative process. 


4.5 EARLY ADJUSTMENTS TO THE BASIC MODEL 


In fact, this is exactly what Brin and Page did. They describe their adjustments to the basic 
PageRank model in their original 1998 papers. It is interesting to note that none of their 
papers used the phrase “Markov chain,” not even once. Although, most surely, if they were 
unaware of it in 1998, they now know the connection their original model has to Markov 
chains, as Markov chain researchers have excitedly and steadily jumped on the PageRank 
bandwagon, eager to work on what some call the grand application of Markov chains. 


Rather than using Markov chains and their properties to describe their adjustments, 
Brin and Page use the notion of a random surfer. Imagine a web surfer who bounces 
along randomly following the hyperlink structure of the Web. That is, when he arrives at 
a page with several outlinks, he chooses one at random, hyperlinks to this new page, and 
continues this random decision process indefinitely. In the long run, the proportion of time 
the random surfer spends on a given page is a measure of the relative importance of that 
page. If he spends a large proportion of his time on a particular page, then he must have, in 
randomly following the hyperlink structure of the Web, repeatedly found himself returning 
to that page. Pages that he revisits often must be important, because they must be pointed 
to by other important pages. Unfortunately, this random surfer encounters some problems. 
He gets caught whenever he enters a dangling node. And on the Web there are plenty of 
nodes dangling, e.g., pdf files, image files, data tables, etc. To fix this, Brin and Page define 


? Almost 100 years ago in 1906, Andrei Andreyevich Markov invented the chains that after 1926 bore his name 
[20]. 
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their first adjustment, which we call the stochasticity adjustment because the 07 rows of 
H are replaced with 1/ne*, thereby making H stochastic. As a result, the random surfer, 
after entering a dangling node, can now hyperlink to any page at random. For the tiny 
6-node web of Figure 4.1, the stochastic matrix called S is 


O° Mies dos iv 20, 26 
1/6 1/6 1/6 1/6 1/6 1/6 


g-|V/3 1/3 0 0 1/3 0 


O. 0> O- Oi “je 1p 
G0) Or, Go Oy. 09 
Oi) <0). Or “RL De OO 


Writing this stochasticity fix mathematically reveals that S is created from a rank- 
one update to H. That is, S = H + a(1/ne”), where a; = 1 if page i is a dangling 
node and 0 otherwise. The binary vector a is called the dangling node vector. S is a 


combination of the raw original hyperlink matrix H and a rank-one matrix 1/nae’. 


This adjustment guarantees that S is stochastic, and thus, is the transition probability 
matrix for a Markov chain. However, it alone cannot guarantee the convergence results 
desired. (That is, that a unique positive 77 exists and that equation (4.2.1) will converge 
to this #7 quickly.) Brin and Page needed another adjustment-this time a primitivity 
adjustment. With this adjustment, the resulting matrix is stochastic and primitive. A 
primitive matrix is both irreducible and aperiodic. Thus, the stationary vector of the chain 
(which is the PageRank vector in this case) exists, is unique, and can be found by a simple 
power iteration. Brin and Page once again use the random surfer to describe these Markov 
properties. 


The random surfer argument for the primitivity adjustment goes like this. While it 
is true that surfers follow the hyperlink structure of the Web, at times they get bored and 
abandon the hyperlink method of surfing by entering a new destination in the browser’s 
URL line. When this happens, the random surfer, like a Star Trek character, “teleports” to 
the new page, where he begins hyperlink surfing again, until the next teleportation, and so 
on. To model this activity mathematically, Brin and Page invented a new matrix G, such 
that 


G =0S8+4+(1—a)1/nee’, 


where a is a scalar between 0 and 1. G is called the Google matrix. In this model, a is a 
parameter that controls the proportion of time the random surfer follows the hyperlinks as 
opposed to teleporting. Suppose a = .6. Then 60% of the time the random surfer follows 
the hyperlink structure of the Web and the other 40% of the time he teleports to a random 
new page. The teleporting is random because the teleportation matrix E = 1/nee” is 
uniform, meaning the surfer is equally likely, when teleporting, to jump to any page. 


There are several consequences of the primitivity adjustment. 


e G is stochastic. It is the convex combination of the two stochastic matrices S and 
E = 1/nee’. 
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G is irreducible. Every page is directly connected to every other page, so irreducibil- 
ity is trivially enforced. 


G is aperiodic. The self-loops (G;; > 0 for all 2) create aperiodicity. 


G is primitive because G* > 0 for some k. (In fact, this holds for k = 1.) This 
implies that a unique positive +7 exists, and the power method applied to G is 
guaranteed to converge to this vector. 


G is completely dense, which is a very bad thing, computationally. Fortunately, G 
can be written as a rank-one update to the very sparse hyperlink matrix H. This is 
computationally advantageous, as we show later in section 4.6. 


G=aS + (1—a)1/nee” 
=a(H + 1/nae’) + (1—a)1/nee” 
=aH + (aa+ (1—a)e)1/ne’. 
G is artificial in the sense that the raw hyperlink matrix H has been twice modi- 
fied in order to produce desirable convergence properties. A stationary vector (thus, 
a PageRank vector) does not exist for H, so Brin and Page creatively cheated to 
achieve their desired result. For the twice-modified G, a unique PageRank vector 


exists, and as it turns out, this vector is remarkably good at giving a global impor- 
tance value to webpages. 


Notation for the PageRank Problem 


very sparse, raw substochastic hyperlink matrix 

sparse, stochastic, most likely reducible matrix 

completely dense, stochastic, primitive matrix called the Google Matrix 
completely dense, rank-one teleportation matrix 

number of pages in the engine’s index = order of H, S, G, E 

scaling parameter between 0 and | 


stationary row vector of G called the PageRank vector 


B38 = am wo 


binary dangling node vector 


In summary, Google’s adjusted PageRank method is 


mRTDT — _ZO)TE, (4.5.1) 


which is simply the power method applied to G. 


We close this section with an example. Returning again to Figure 4.1, fora = .9, 
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the stochastic, primitive matrix G is 
0 
G=.9H + (.9 


+1 eeCE Oly 2. Aa GH 


oOo = 
Pe RP RRR 


0 
1/60 7/15 7/15 1/60 1/60 1/60 
1/6 1/6 1/6 1/6 1/6 1/6 
19/60 19/60 1/60 1/60 19/60 1/60 
1/60 1/60 1/60 1/60 7/15 7/15 
1/60 1/60 1/60 7/15 1/60 7/15 
1/60 1/60 1/60 11/12 1/60 1/60 


Google’s PageRank vector is the stationary vector of G and is given by 


1 2 3 4 5 6 
nt = ( 03721 05396 .04151 .3751 — .206 2862). 


The interpretation of 7; = .03721 is that 3.721% of the time the random surfer vis- 
its page 1. Therefore, the pages in this tiny web can be ranked by their importance as 
(4 6 5 2 3 1), meaning page 4 is the most important page and page | is the least 
important page, according to the PageRank definition of importance. 


4.6 COMPUTATION OF THE PAGERANK VECTOR 


The PageRank problem can be stated in two ways: 


1. Solve the following eigenvector problem for 7”. 
wr =n'G, 
we=1. 


2. Solve the following linear homogeneous system for 7° . 


nx? (I—G)=07, 


nwe=l. 


In the first system, the goal is to find the normalized dominant left-hand eigenvector of G 
corresponding to the dominant eigenvalue 1 = 1. (G is a stochastic matrix, so A1 = 1.) 
In the second system, the goal is to find the normalized left-hand null vector of I— G. 
Both systems are subject to the normalization equation 77e = 1, which insures that 77 
is a probability vector. In the example in section 4.5, G is a 6 x 6 matrix, so we used 
Matlab’s eig command to solve for +7, then normalized the result (by dividing the vector 
by its sum) to get the PageRank vector. However, for a web-sized matrix like Google’s, this 
will not do. Other more advanced and computationally efficient methods must be used. Of 
course, 7 is the stationary vector of a Markov chain with transition matrix G, and much 
research has been done on computing the stationary vector for a general Markov chain. See 
William J. Stewart’s book Introduction to the Numerical Solution of Markov Chains [154], 
which contains over a dozen methods for finding 7’. However, the specific features of the 
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PageRank matrix G make one numerical method, the power method, the clear favorite. In 
this section, we discuss the power method, which is the original method proposed by Brin 
and Page for finding the PageRank vector. We describe other more advanced methods in 
Chapter 9. 


The World’s Largest Matrix Computation 


Cleve Moler, the founder of Matlab, wrote an article [131] for his October 2002 
newsletter Matlab News that cited PageRank as “The World’s Largest Matrix 
Computation.” Then Google was applying the power method to a sparse matrix 
of order 2.7 billion. Now it’s up to 8.1 billion! 


The power method is one of the oldest and simplest iterative methods for finding 
the dominant eigenvalue and eigenvector of a matrix.? Therefore, it can be used to find 
the stationary vector of a Markov chain. (The stationary vector is simply the dominant 
left-hand eigenvector of the Markov matrix.) However, the power method is known for 
its tortoise-like speed. Of the available iterative methods (Gauss-Seidel, Jacobi, restarted 
GMRES, BICGSTAB, etc. [18]), the power method is generally the slowest. So why did 
Brin and Page choose a method known for its sluggishness? There are several good reasons 
for their choice. 


First, the power method is simple. The implementation and programming are ele- 
mentary. (See the box on page 42 for a Matlab implementation of the PageRank power 
method.) In addition, the power method applied to G (equation (4.5.1)) can actually be 
expressed in terms of the very sparse H. 


nRtDT — HT E 


a ppivig f tae ayn. st 
n 

=anr)TH + (anTa+1—a)e™/n. (4.6.1) 
The vector-matrix multiplications (")? H) are executed on the extremely sparse H, and 
S and G are never formed or stored, only their rank-one components, a and e, are needed. 
Recall that each vector-matrix multiplication is O(n) since H has about 10 nonzeros per 
row. This is probably the main reason for Brin and Page’s use of the power method in 
1998. But why is the power method still the predominant method in PageRank research 
papers today, and why have most improvements been novel modifications to the PageRank 
power method, rather than experiments with other methods? The other advantages of the 
PageRank power method answer these questions. 


The power method, like many other iterative methods, is matrix-free, which is a term 
that refers to the storage and handling of the coefficient matrix. For matrix-free methods, 
the coefficient matrix is only accessed through the vector-matrix multiplication routine. No 
manipulation of the matrix is done. Contrast this with direct methods, which manipulate 
elements of the matrix during each step. Modifying and storing elements of the Google 


3The power method goes back at least to 1913. With the help of James H. Wilkinson, the power method 
became the standard method in the 1960s for finding the eigenvalues and eigenvectors of a matrix with a digital 
computer [152, p. 69-70]. 
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matrix is not feasible. Even though H is very sparse, its enormous size and lack of structure 
preclude the use of direct methods. Instead, matrix-free methods, such as the class of 
iterative methods, are preferred. 


The power method is also storage-friendly. In addition to the sparse matrix H and 
the dangling node vector a, only one vector, the current iterate +*)7, must be stored. 
This vector is completely dense, meaning n real numbers must be stored. For Google, 
n = 8.1 billion, so one can understand their frugal mentality when it comes to storage. 
Other iterative methods, such as GMRES or BICGSTAB, while faster, require the storage 
of multiple vectors. For example, a restarted GMRES(10) requires the storage of 10 vectors 
of length n at each iteration, which is equivalent to the amount of storage required by the 
entire H matrix, since nnz(H) ~ 10n. 


The last reason for using the power method to compute the PageRank vector con- 
cerns the number of iterations it requires. Brin and Page reported in their 1998 papers, 
and others have confirmed, that only 50-100 power iterations are needed before the iterates 
have converged, giving a satisfactory approximation to the exact PageRank vector. Recall 
that each iteration of the power method requires O(n) effort because H is so sparse. As a 
result, it’s hard to find a method that can beat 50 O(n) power iterations. Algorithms whose 
run time and computational effort are linear (or sublinear) in the problem size are very fast, 
and rare. 


The next logical question is: why does the power method applied to G require only 
about 50 iterations to converge? Is there something about the structure of G that indicates 
this speedy convergence? The answer comes from the theory of Markov chains. In general, 
the asymptotic rate of convergence of the power method applied to a matrix depends on the 
ratio of the two eigenvalues that are largest in magnitude, denoted A; and Ag. Precisely, 
the asymptotic convergence rate is the rate at which |2/A1|* — 0. For stochastic matrices 
such as G, \1 = 1, so || governs the convergence. Since G is also primitive, |A2| < 1. 
In general, numerically finding A2 for a matrix requires computational effort that one is not 
willing to spend just to get an estimate of the asymptotic rate of convergence. Fortunately, 
for the PageRank problem, it’s easy to show [127, p. 502], [90, 108] that if the respective 
spectrums are o(S) = {1, ja,..., Un} and o(G) = {1,A2,..., An}, then 


Ap = ap, for k=2,3,...,n. 


(A short proof of this statement is provided at the end of this chapter.) Furthermore, the link 
structure of the Web makes it very likely that |z2| = 1 (or at least |~2| ~ 1), which means 
that |A2(G)| = a (or |A2(G)| & a). As a result, the convex combination parameter a 
explains the reported convergence after just 50 iterations. In their papers, Google founders 
Brin and Page use a = .85, and at last report, this is still the value used by Google. 
a? = .85°° = .000296, which implies that at the 50th iteration one can expect roughly 
2-3 places of accuracy in the approximate PageRank vector. This degree of accuracy is 
apparently adequate for Google’s ranking needs. Mathematically, ten places of accuracy 
may be needed to distinguish between elements of the PageRank vector (see Section 8.3), 
but when PageRank scores are combined with content scores, high accuracy may be less 
important. 
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Subdominant Eigenvalue of the Google Matrix 
For the Google matrix G = aS + (1 —a)1/nee’, 


|A2(G)| < a. 


e For the case when |\2(S)| = 1 (which occurs often due to the reducibility of 
the web graph), |A2(G)| = a. Therefore, the asymptotic rate of convergence of 
the PageRank power method of equation (4.6.1) is the rate at which a* — 0. 


We can now give positive answers to the six questions of section 4.3. With the 


stochasticity and primitivity adjustments, the power method applied to G is guaranteed to 
converge to a unique positive vector called the PageRank vector, regardless of the starting 
vector. Because the resulting PageRank vector is positive, there are no undesirable ties 
at 0. Further, to produce PageRank scores with approximately 7 digits of accuracy about 
—1/logioq iterations must be completed. 


Matlab m-file for PageRank Power Method 


This m-file is a Matlab implementation of the PageRank power method given in 
equation (4.6.1). 


function [pi,time,numiter]=\hbox{PageRank} (pi0,H,n,alpha,epsilon) ; 


dP dP dP dP dP dP ABP dP BDO BP DP GE DP dP dP dP dP BP DP GO oO 


\hbox{PageRank} computes the \hbox{PageRank} vector for an n-by-n Markov 
matrix H with starting vector pi0 (a row vector) 
and scaling parameter alpha (scalar). Uses power 
method. 


EXAMPLE: [pi,time,numiter]=\hbox{PageRank} (pi0,H,1000,.9,1e-8); 


INPUT: pid0 = starting vector at iteration 0 (a row vector) 
H = row-normalized hyperlink matrix (n-by-n sparse matrix) 
n = size of H matrix (scalar) 
alpha = scaling parameter in \hbox{PageRank} model (scalar) 
epsilon = convergence tolerance (scalar, e.g. le-8) 


OUTPUT: pi = \hbox{PageRank} vector 
time = time required to compute \hbox{PageRank} vector 
numiter = number of iterations until convergence 


The starting vector is usually set to the uniform vector, 

pid=1/n*ones(1,n). 

NOTE: Matlab stores sparse matrices by columns, so it is faster 
to do some operations on H’, the transpose of H. 
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% 


get "a", the dangling node vector, where a(i)=1, if node i 
is dangling node and 0, o.w. 


rowsumvector=ones(1,n)*H’; 
nonzerorows=find(rowsumvector) ; 
zerorows=setdiff(1:n,nonzerorows); l=length(zerorows) ; 
a=sparse(zerorows,ones(1,1),ones(1,1),n,1); 


k=0; 
residual=1; 
pi=pi0; 
Gi? 


while (residual >= epsilon) 


prevpi=pi; 

k=k+1; 

piz=alpha*pi*H + (alpha* (pi*a)+1l-alpha) *((1/n)*ones(1,n)); 
residual=norm(pi-prevpi,1); 


end 
numiter=k; 
time=toc; 


Search within a Site 


In the competitive business of search, Google is refreshingly generous at times. 
For example, at no charge, Google lets website authors employ its technology to 
search within their site. (Clicking on the “more” button on Google’s home page 
will lead you to the latest information on their services.) For queries within a 
site, Google restricts the set of relevant pages to only in-site pages. These in-site 
relevant pages are then ranked using the global PageRank scores. In essence, this 
in-site search extracts the site from Google’s massive index of billions of pages 
and untangles the part of the Web pertaining to the site. Looking at an individual 
subweb makes for a much more manageable hyperlink graph. 


ASIDE: Search Engine Optimization 


As more and more sales move online, large and small businesses alike turn to search 
engine optimizers (SEOs) to help them boost profits. SEOs carefully craft webpages and links 
in order to “optimize” the chances that their clients’ pages will appear in the first few pages of 
search engine results. SEOs can be classified as ethical or unethical. Ethical SEOs are good 
netizens, citizens of the net, who offer only sound advice, such as the best way to display text 
and label pictures and tags. They encourage webpage authors to maintain good content, as 
page rankings are the combination of the content score and the popularity score. They also 
warn authors that search engines punish pages they perceive as deliberately spamming. Ethical 
SEOs and search engines consider themselves partners who, by exchanging information and 
tips, together improve search quality. Unethical SEOs, on the other hand, intentionally try 
to outwit search engines and promote spamming techniques. See the aside on page 52 for a 
specific case of unethical SEO practices. Since the Web’s infancy, search engines have been 
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embroiled in an eternal battle with unethical SEOs. The battle rages all over the Web, from 
visible webpage content to hidden metatags, from links to anchor text, and from inside servers 
to out on link farms (again, see the aside on page 52). 


SEOs had success against the early search engines by using term spamming and hiding 
techniques [84]. In term spamming, spam words are included in the body of the page, often 
times repeatedly, in the title, metatags, anchor text, and URL text. Hiding techniques use 
color schemes and cloaking to deceive search engines. For example, using white text on a 
white background makes spam invisible to human readers, which means search engines are 
less likely to receive helpful complaints about pages with hidden spam. Cloaking refers to the 
technique of returning one spam-loaded webpage for normal user requests and another spam- 
free page for requests from search engine crawlers. As long as authors can clearly identify web 
crawling agents, they can send the agent away with a clean, spam-free page. Because these 
techniques are so easy for webpage authors to use, search engines had to retaliate. They did 
so by increasing the IQ of their spiders and indexers. Many spiders and indexers are trained 
to ignore metatags, since by the late 1990s these rarely held accurate page information. They 
also ignore repeated keywords. However, cloaking is harder to counteract. Search engines 
request help from users to stop cloaking. For example, Google asks surfers to act as referees 
and to blow the whistle whenever they find a suspicious page that instantaneously redirects 
them to a new page. 


In 1998, search engines added link analysis to their bag of tricks. As a result, content 
spam and cloaking alone could no longer fool the link analysis engines and garner spam- 
mers unjustifiably high rankings. Spammers and SEOs adapted by learning how link analysis 
works. The SEO community has always been active—its members, then and now, hold con- 
ferences, write papers and books, host weblogs, and sell their secrets. The most famous and 
informative SEO papers were written by Chris Ridings, “PageRank explained: Everything 
you’ve always wanted to know about PageRank” [143] and “PageRank uncovered” [144]. 
These papers offer practical strategies for hoarding PageRank and avoiding such undesirable 
things as PageRank leak. Search engines constantly tune their algorithms in order to stay one 
step ahead of the SEO gamers. While search engines consider unethical SEOs to be adver- 
saries, some web analysts call them an essential part of the web food chain, because they drive 
innovation and research and development. 


ASIDE: How Do Search Engines Make Money? 


We are asked this question often. It’s a good question. Search engines provide free and 
unlimited access to their services, so just where do the billions of dollars in search revenue 
come from? Search engines have multiple sources of income. First, there’s the inclusion fee 
that some search engines charge website authors. Some impatient authors want a guarantee 
that their new site will be indexed soon (in a day or two) rather than in a month or two, when 
a spider finally gets to it in the to-be-crawled URL list. Search engines supply this guarantee 
for a small fee, and for a slightly larger fee, authors can guarantee that their site be reindexed 
on a more frequent, perhaps monthly, basis. 


Most search engines also generate revenue by selling profile data to interested parties. 
Search engines collect enormous amounts of user data on a daily basis. This data are used to 
improve the quality of search and predict user needs, but it is also sold in an aggregate form to 
various companies. For example, search engine optimization companies who are interested in 
popular query words or the percentage of searches that are commercial in nature can buy this 
information directly from a search engine. 
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While search engines do not sell access to their search capabilities to individual users, 
they do sell search services to companies. For example, Netscape pays Google to use Google 
search as the default search provided by its browser. At one point, GoTo (which was bought 
by Overture, which is now part of Yahoo) sold its top seven results for each query term to 
Yahoo and AltaVista, who, in turn, used the seven results as their top results. 


Despite these sources of income, by far the most profitable and fastest-growing revenue 
source for search engines is advertising. It is estimated that in 2004 $3 billion in search 
revenue will be generated from advertising. Google’s IPO filing on June 21, 2004 made the 
company’s dependence on advertising very clear: advertising accounted for over 97% of their 
2003 revenue. Many search engines sell banner ads that appear on their homepages and results 
pages. Others sell pay-for-placement ads. These controversial ads allow a company to buy 
their way to the top of the ranking. Many web analysts argue that these pay-for-placement 
ads pollute the search results. However, search engines using this technique (GoTo is a prime 
example) retort that this method of ranking is excellent for commercial searches. Since recent 
surveys estimate that 15-30% of all searches are commercial in nature, engines like Overture 
provide a valuable service for this class of queries. On the other hand, many searches are 
research-oriented, and the results of pay-for-placement engines frustrate these users. 


Google takes a different approach to advertisements and rankings. They present the 
unpaid results in a main list while pay-for-placement sites appear separately on the side as 
“sponsored links.” Google, and now Yahoo, are the only remaining companies not to mingle 
paid links with pure links. Google uses a cost-per-click advertising scheme to present spon- 
sored links. Companies choose a keyword associated with their product or service, and then 
bid on a price they are willing to pay each time a searcher clicks on their link. For example, 
a bike shop in Raleigh may bid 5 cents for every query on “bike Raleigh.” The bike shop is 
billed only if a searcher actually clicks on their ad. However, another company may bid 17 
cents for the same query. The ad for the second company is likely to appear first because, al- 
though there is some fine tuning and optimization, sponsored ads generally are listed in order 
from the highest bid to the lowest bid. 


Cost-per-click advertising is an innovation in marketing. Small businesses who tradi- 
tionally spent little on advertising are now spending much more on web advertising because 
cost-per-click advertising is so cost-effective. If a searcher clicks on the link, he or she is 
indicating an intent to buy, something that other means of advertising such as billboards or 
mail circulars cannot deliver. Interestingly, like many other things on the Web, it was only a 
matter of time before cost-per-click advertising turned into a battleground between competi- 
tors. Without protection (which can be purchased in the form of a software program) naive 
companies buying cost-per-click advertising can easily be sabotaged by competitors. Com- 
petitors repeatedly click on the naive company’s ads, running up their tab and exhausting the 
company’s advertising budget. 


4.7 THEOREM AND PROOF FOR SPECTRUM OF THE GOOGLE MATRIX 


In this chapter, we defined the Google matrix as G = aS + (1 — a)1/nee”. However, 
in the Section 5.3 of the next chapter, we broaden this to include a more general Google 
matrix, where the fudge factor matrix E changes from the uniform matrix 1/n ee” to ev’, 
where v” > 0 is a probability vector. In this section, we present the theorem and proof for 


the second eigenvalue of this more general Google matrix. 
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Theorem 4.7.1. If the spectrum of the stochastic matrix S is {1,2,3,---,An}, then 
the spectrum of the Google matrix G = aS + (1 — ajev" is {1,aA2, aA3,..., An}, 
where v" is a probability vector. 


Proof. Since S is stochastic, (1,e) is an eigenpair of S. LetQ = (e X) bea non- 


singular matrix that has the eigenvector e as its first column. Let Q~! = ($1 . Then 


T T 
Q-'Q = Ges yey ) = € a which gives two useful identities, yre = 1 and 


Ye = 0. As a result, the similarity transformation 


T T £ 
-1 _fyrve y’SX)\ /1 ySX 
eke ($70 eee a OS Y7SX 
reveals that Y7SX contains the remaining eigenvalues of S, \2,..., An. Applying the 
similarity transformation to G = aS + (1 — a)ev" gives 
Q“1(aS + (1 — ajev")Q=0Q7'SQ + (1-a)Q-tev"Q 
T T 
_ fa ay’SX Sede. ye T . 
=(5 ign) - a) (¥r8 ) (v ave) 


G eae eg, oa acs) 


0 aY’SX 0 0 
_ (1 ay?SX4+(1-a)v?X 
~\O aY?Sx ; 


Therefore, the eigenvalues of G = aS + (1 — a)ev! are {1, a2, a3, ..., @An}- | 


Chapter Five 


Parameters in the PageRank Model 


My grandfather, William H. Langville, Sr., loved fiddling with projects in his basement 
workshop. Down there he had a production process for making his own shad darts for 
fishing. He poured lead into a special mold, let it cool, then applied bright paints. He 
manufactured those darts by the dozens, which was good because on each fishing trip my 
brothers, cousins, and I always lost at least three each to trees, underwater boots, poor 
knot-tying, and of course, really big, sharp-toothed fish. Grandpop kept meticulous fishing 
records of where, when, how many, and which type of fish he caught each day. He also 
noted the style of dart he’d used. He looked for success patterns. It wasn’t long before he 
started fiddling with his manufacturing process, making big darts, small darts, green darts, 
orange darts, two-toned darts, feathered darts, darts with sinks, and darts with spinners. He 
found the fiddling fun—hypothesizing, testing, and reporting what happened if he tweaked 
this parameter that way, that parameter this way. 


We agree with Grandpop. The fun is in the fiddling. In this chapter, we introduce the 
various methods for fiddling with the basic PageRank model of Chapter 4, and then, like 
Grandpop, consider the implications of such changes. 


5.1 THE a FACTOR 


In Chapter 4, we introduced the scaling parameter 0 < a < 1 to create the Google matrix 
G = aS+(1—a)E. The constant a clearly controls the priority given to the Web’s natural 
hyperlink structure as opposed to the artificial teleportation matrix E. In their early papers 
[39, 40], Brin and Page, the founders of Google, suggest setting a = .85. Like many 
others, we wonder why .85? Why not .9? Or .95? Or .6? What effect does a have on the 
PageRank problem? In Chapter 4, we mentioned that the scaling parameter controlled the 
asymptotic rate of convergence of the PageRank power method. Reviewing the conclusion 
there, as a — 1, the expected number of iterations required by the power method increases 
dramatically. See Table 5.1 below. 


For a = .5, only about 34 iterations are expected before the power method has 
converged to a tolerance of 10~'°. As a — 1, this number becomes prohibitive. Even 
using @ = .85, this choice of a still requires several days of computation before satisfac- 
tory convergence due to the scale of the matrices and vectors involved. This means that 
Google engineers are forced to perform a delicate balancing act—as a — 1, the artificial- 
ity introduced by the teleportation matrix E = 1/n ee? reduces, yet the computation time 
increases. 


It seems that setting a = .85 strikes a workable compromise between efficiency and 
effectiveness. Interestingly, this constant a controls more than just the convergence of the 
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Table 5.1 Effect of a on expected number of power iterations 


a Number of Iterations 
5) 34 

75 81 

8 104 

85 142 

a) 219 

95 449 

99 2,292 

.999 23,015 


PageRank method; it affects the sensitivity of the resulting PageRank vector. Specifically, 
as a — 1, the PageRankings become much more volatile, and fluctuate noticeably for even 
small changes in the Web’s structure. The Web’s dynamic nature, which we emphasized 
in Chapter 1, makes sensitivity an important issue. Ideally, we’d like to produce a ranking 
that is stable despite such small changes. The sensitivity issue, especially a’s effect on it, 
is treated in depth in the next chapter. 


5.2 THE HYPERLINK MATRIX H 


Another part of the PageRank model that can be adjusted is the H matrix itself. Brin and 
Page originally suggested a uniform weighting scheme for filling in elements in H. That 
is, all outlinks from a page are given equal weight in terms of the random surfer’s hy- 
perlinking probabilities. While fair, democratic, and easy to implement, equality may not 
be best for webpage rankings. In fact, the random surfer description may not be accurate 
at all. Rather than hyperlinking to new pages by randomly selecting one of the outlinking 
pages, perhaps surfers select new pages by choosing outlinking pages with a lot of valuable 
content or pertinent descriptive anchor text. (To understand the importance of anchor text, 
see the aside on Google bombs on page 54.) In this case, take the random surfer who plays 
eeni-meeni-meini-mo to decide which page to visit next and replace him with an intelli- 
gent surfer who upon arriving at a new page pulls a calculator from his chest pocket and 
pecks away until he decides which page is most appropriate to visit next (based on current 
location, interests, history, and so on). For example, the intelligent surfer may be more 
likely to jump to content-filled pages, so these pages should be given more probabilistic 
weight than brief advertisement pages. 


A practical approach to filling in H’s elements is to use access logs to find actual 
surfer tendencies. For example, a webmaster can study his access logs and find that surfers 
on page P, are twice as likely to hyperlink to P as they are to P3. Thus, outlinking 
probabilities in row 1 of H can be adjusted accordingly. For the webgraph from Figure 
4.1, the original hyperlink matrix using the random surfer description, 
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P, 0 1/2 1/2 O 0 0 

Py 0 0 0 0 0 0 

u Pz; |} 1/3 1/3 O 0 1/3 O 

~ Py 0 0 0 0 1/2 1/2 ]’ 

Ps 0 0 0 1/2 0 1/2 

Ps 0 0 0 1 0 0 

changes to 

P, Pop P3 Py Py Pe 

Py 0 2/3 1/3 0 0 0 

P» 0 0 0 0 0 0 

H Pz; | 1/3 1/3 0 0 1/3 O 


BRlo o 0 0 12 12) 


when the intelligent surfer description is applied to page P;. 
PageRank 


researchers have presented many other methods for filling in the elements of raw 
hyperlink matrix H [13, 26, 27, 142, 159]. These methods use heuristic rules to create the 
nonzero elements of H by combining measures concerning the location of the outlinks in 
a page, the length of the anchor text associated with the outlinks, and the content similarity 
between the two documents connected by a link. For example, row 4 of the above matrix 
shows that page P, links to pages P; and Pg. The probabilities in H4; and H4¢ can be 
determined by computing the angle similarity measure between pages Py and Ps and P, 
and Ps, respectively. The angle similarity measure is an important part of a traditional 
information retrieval model, the vector space model of Chapter | [23]. Regardless of how 
H is created, it is important, in the context of the Markov chain, that the resulting matrix 
be nearly stochastic. That is, the rows corresponding to nondangling nodes (pages with 
at least one outlink) sum to 1, while rows for dangling nodes sum to 0. If this is not the 
case, the rows must be normalized. We will discuss other non-Markovian ranking models 
in Chapters 11 and 12. 


5.3 THE TELEPORTATION MATRIX E 


One of the first modifications to the basic PageRank model that founders Brin and Page 
suggested was a change to the teleportation matrix E. Rather than using 1/nee?, they 
used ev’, where v’ > 0 is a probability vector called the personalization or telepor- 
tation vector. Since v” is a probability vector with positive elements, every node is still 
directly connected to every other node; thus, G is primitive, which means that a unique sta- 
tionary vector for the Markov chain exists and is the PageRank vector. Using v” in place 
of 1/ne™ means that the teleportation probabilities are no longer uniformly distributed. 
Instead, each time a surfer teleports, he or she follows the probability distribution given in 
v" to jump to the next page. This slight modification retains the advantageous properties 
of the power method. When G = aS + (1 — a)ev", the power method becomes 
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k+1)T _ 2(k)T EQ 


=anr"TS 4 (1—a)rT ev? 


=anTH + (anTa+1—a)v?. (5.3.1) 


a! 


Compare equation (5.3.1) with equation (4.6.1) on page 40, which uses the original demo- 
cratic teleportation matrix E = 1/nee?. Since only the constant vector added at each 
iteration changes from e? /n to v”, nearly all our Chapter 4 discoveries concerning the 
PageRank power method still apply. Specifically, the asymptotic rate of convergence, 
sparse vector-matrix multiplications, minimal storage, and coding simplicity are preserved. 
However, one thing that does change is the PageRank vector itself. Different personaliza- 
tion vectors produce different PageRankings [158]. That is, 77 (v7) is a function of v’. 


Recognizing the uses of v7 is liberating. Think about it. Why should we all be 
subject to the same ranking of webpages? That single global, query-independent ranking 
a (which uses v’ = 1/ne”) says nothing about me and my preferences. As Americans, 
aren’t we all entitled to our own individual ranking vector—one that knows our personal 
preferences regarding pages and topics on the Web. If you like to surf for pages about news 
and current events, simply bias your v” vector, so that v; is large for pages P, about news 
and current events and v; is nearly 0 for all other pages, and then compute the PageRank 
vector that’s tailored to your needs. Politicians can add another phrase to their campaign 
promises: “a car in every garage, a computer in every home, and a personalization vector 
v" for every web surfer.” 


This seems to have been Google’s original intent in introducing the personalization 
vector [38]. However, it makes the once query-independent, user-independent PageRanks 
user-dependent and more calculation-laden. Tailoring rankings for each user sounds won- 
derful in theory, yet doing this in practice is computationally impossible. Remember, it 
takes Google days to compute just one 27 corresponding to one v’ vector, the democratic 
personalization vector v? = 1/ne?. 


Motivated in part by the fact that many see personalized engines as the future of 
search, several researchers have ignored the claims of computational impossibility and 
have created pseudo-personalized PageRanking systems [58, 88, 91, 99, 142]. We say 
pseudo because these systems do not deliver rankings that are customized for each and 
every user, but rather groups of users. 


One such system was the product of Taher Haveliwala, while he was a graduate stu- 
dent at Stanford. He adapted the standard, query-independent PageRank to create a topic- 
sensitive PageRank [88, 89]. He created a finite number of PageRank vectors 27 (v7 ) 
each biased toward some particular topic 7. For his experiments, Haveliwala chose the 16 
top-level topics from the Open Directory Project (ODP) classification of webpages. For 
example, suppose 17 (v/)) is the PageRank vector for Arts, the first ODP topic, while 
a (v3) is the PageRank vector for Business, the second ODP topic. 7 (v/) is biased 
toward Arts because v? has significant probabilities only for pages pertaining to Arts, the 
remaining probabilities are nearly 0. The 16 biased PageRank vectors are precomputed. 
Then at query time, the trick is to quickly combine these biased vectors in a way that 
mimics the interests of the user and meanings of the query. Haveliwala forms his topic- 


sensitive, query-dependent PageRank vector as a convex combination of the 16 biased 


) 


PARAMETERS IN THE PAGERANK MODEL 51 


PageRank vectors. That is, 
mr = By" (vi) + Bom? (vz) ++-- + Pret” (vig), 


where Daerah = 1. For instance, a query on science project ideas falls between 
the ODP categories of Kids and Teens (category 7), Reference (category 10), and Science 
(category 12). Logically, the PageRank vectors associated with these topics should be 
given more weight, or even all the weight, so that 37, G19, and (2 are large compared 
to the other coefficients. Haveliwala uses a Bayesian classifier to compute the (@;’s for 
his experiments, but there are other options. When all this is done, the topic-sensitive 
popularity score is combined with the traditional content score from Chapter 1. Of course, 
if a finer gradation of personalization is desired, more than 16 topics can be used to better 
bias the rankings toward the user’s query and interests. 


It seems this little personalization vector v” has potentially more significant side ef- 
fects. Some speculate that Google can use this personalization vector to control spamming 
done by the so-called link farms. See the aside, SearchKing vs. Google, on page 52. 


Kaltix’s Personalized Web Search 


It didn’t take Google long to recognize the value of personalized search. In 
fact, Google snatched up Kaltix, a personalized search startup, just three months 
after its inception. Kaltix technology was created by Glen Jeh, Sepandar 
Kamvar, and Taher Haveliwala in the summer of 2003, while the three were 
on leaves of absence from the Stanford Computer Science Department. The 
Kaltix guys worked 20 hours a day that summer, literally working their fin- 
gers to the bone, falling asleep some nights with ice packs on their over- 
worked wrists. The hard work paid off. Google bought Kaltix in Septem- 
ber 2003, and the three moved into the Google headquarters to continue the 
project. In March 2004, Google labs released Personalized Search in beta ver- 
sion (http: //labs.google.com/personalized). A user creates a profile by 
setting check boxes in a hierarchical listing of categories of interest. A person- 
alization vector is created from this profile. Then when a query is entered into 
the Personalized Search box, the results are presented in the standard ranked list. 
However, in addition, a slider bar allows one to turn up the level of customization 
and increase the effect of the personalization vector. 


Matlab m-file for Personalized PageRank Power Method 


The Matlab implementation of the PageRank power method on page 42 used a 
uniform personalization vector v? = e? /n. This m-file, which is a simple one- 
line change in that code, implements a more general PageRank power method, 
allowing the personalization vector to vary as input. Therefore, the m-file below 


implements the PageRank power method applied to G = aS + (1—a)ev’. 
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function [pi, time, numiter]=\hbox{PageRank} (pi0,H,v,n,alpha,epsilon) ; 


% \hbox{PageRank} computes the \hbox{PageRank} vector for an n-by-n Markov 


& matrix H with starting vector pi0 (a row vector), 

% scaling parameter alpha (scalar), and teleportation 
% vector v (a row vector). Uses power method. 

% 

% EXAMPLE: [pi, time,numiter]=\hbox{PageRank} (pi0,H,v,900,.9,1e-8); 


% INPUT: pi0 = starting vector at iteration 0 (a row vector) 


% H = row-normalized hyperlink matrix (n-by-n sparse matrix) 
& v = teleportation vector (1-by-n row vector) 

& n = size of P matrix (scalar) 

& alpha = scaling parameter in \hbox{PageRank} model (scalar) 
& epsilon = convergence tolerance (scalar, e.g. 1le-8) 

% 

% OUTPUT: pi = \hbox{PageRank} vector 

& time = time required to compute \hbox{PageRank} vector 

% numiter = number of iterations until convergence 


% The starting vector is usually set to the uniform vector, 

% pi0=1/n*ones(1,n). 

% NOTE: Matlab stores sparse matrices by columns, so it is faster 
& to do some operations on H’, the transpose of H. 


% get "a" vector, where a(i)=1, if row i is dangling node 
% and 0, o.w. 


rowsumvector=ones(1,n)*H’; 
nonzerorows=find(rowsumvector) ; 
zerorows=setdiff(1:n,nonzerorows); l=length(zerorows) ; 
a=sparse(zerorows,ones(1,1),ones(1,1),n,1); 


k=0; 
residual=1; 
pi=pi0; 
tic: 


while (residual >= epsilon) 
prevpi=pi; 
k=k+1; 
pi=alpha*pi*H + (alpha*(pi*a)+1l-alpha) *v; 
residual=norm(pi-prevpi,1); 
end 
numiter=k; 
time=toc; 


ASIDE:  SearchKing vs. Google 


Link farms are set up by spammers to fool information retrieval systems into increasing 
the rank of their clients’ pages. One client who made it onto the front page of the Wall 
Street Journal is Joy Holton, the owner of an online store (exoticleatherwear.com) 
that sells provocative leather clothing [160]. Using metatags and HTML coding, she was 
able to attract a modest number of surfers to her store. However, an email from the search 
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engine optimization company, AutomatedLinks, convinced Holman to pay the $22 annual fee 
to use their rank-boosting service. (See the aside on search engine optimization on page 43.) 
AutomatedLinks’ sole efforts are aimed at increasing the PageRank (and ranking among other 
search engines) of their clients’ pages. AutomatedLinks accomplishes this with link farms. 
Knowing that PageRank increases when the number of important inlinks to a client’s page 
increases, optimizers add such links to a client’s page. Link farms have several interconnected 
nodes about important topics and with significant PageRanks. These interconnected nodes 
then link to a client’s page, thus, in essence, sharing some of their PageRank with the client’s 
page. Holman’s $22 investment with AutomatedLinks brought her over 26,000 visitors a 
month and thousands of dollars in revenue. 


Most link farms use a link exchange program or reciprocal linking policy to boost 
the rank of client’s pages, but there are other scenarios for doing this [28, 29]. Of course, 
link farms are very troublesome for search engines who are concerned with the integrity of 
their rankings. Search engines employ several techniques to sniff out link farms. First, they 
ask surfers to be tattletales and report any suspicious pages. Second, they use algorithms to 
identify tightly connected subgraphs of the Web with a high density of reciprocal links. And 
third, they manually inspect the algorithm’s results to determine whether suspected sites play 
fair or foul. Google discourages link spamming by threatening to ban or drop the ranking of 
suspected sites and their neighbors. 


Google’s devalueing of the PageRank of link farmers created a legal stir during 2002 
and 2003. The search engine optimization company, SearchKing, was running smoothly from 
February 2001 until August 2002, in part because it had a high PageRank, which it then shared 
with its clients. Clients with high PageRank had more traffic, and thus happily paid SearchK- 
ing for its rank-boosting service. However, in the few months after August 2002, Bob Massa, 
president of SearchKing, watched the PageRank estimate reported on his Google Toolbar (see 
the box on page 28) drop from PR8 to PR4, then from PR2 to PRO. Of course, his clients 
were affected as well. They complained and many jumped ship. Furious, Bob Massa took 
action on October 17, 2002, by filing a suit against Google with the U.S. District Court for 
the Western District of Oklahoma. SearchKing’s legal team sued Google, demanding $75,000 
in lost revenue plus court fees, the restoration of its and its clients’ previous PageRanks, and 
the disclosure of the source code for the PageRank algorithm used by Google from August 
to October 2002. Both parties knew the import of the case. Its outcome would set a prece- 
dent for the relationship between optimization companies and search engines. SearchKing 
pushed for an early response, Google delayed. By December 30, 2002, Google had prepared 
a powerful, convincing, and well-researched response to SearchKing’s motion for a prelimi- 
nary injunction, and further added a motion to dismiss. There were two main arguments to 
Google’s response. First, Google argued that PageRanks are opinions, the company’s judg- 
ment of the value of webpages. These opinions are protected by the First Amendment. In fact, 
the Google defense team cited a precedent for a similar ranking, the rankings created by credit 
agencies. In 1999 in Jefferson County School District # R-I vs. Moody’s Investors Service, 
Inc., the same court ruled that Moody’s low credit ranking of the school’s district, while possi- 
bly harming the district’s perceived housing and schooling value in the public’s eye, was just 
an opinion and was protected by the First Amendment. Similarly, the Google defense team 
argued: 


The PageRank values assigned by Google are not susceptible to being proved true 
or false by objective evidence. How could SearchKing ever “prove” that its ranking 
should “truly” be a 4 or a 6 or a 8? Certainly, SearchKing is not suggesting that 
each one of the billions of web pages ranked by Google are subject to another “truer” 
evaluation? If it believes so, it is certainly free to develop its own search services using 
the criteria it deems most appropriate. 
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Google also mentioned that its crawlers index just a small part of the Web, and therefore, 
Google is not entitled to index SearchKing’s page in the first place, much less rank it. 


The second part of Google’s argument concerns the “irreparable harm” that could be 
done by SearchKing’s demand to see the PageRank source code. A motion for preliminary 
injunction can be granted if the plaintiff shows (among other required things) that not doing 
so causes irreparable harm to the plaintiff as SearchKing claimed, due to its loss of clients. 
On the other hand, a motion for a preliminary injunction cannot be granted if it causes the de- 
fendant irreparable harm. Google presented the affidavit of Matthew Cutts, Google’s software 
engineer who works on the PageRank quality team. Regarding the irreparable harm issue, 
Cutts stated: 


Google’s source code for its internally developed software is kept confidential by 
Google and has great value to the ongoing business of Google. Indeed, the technology 
that it encodes constitutes one of Google’s most valuable assets. . . . If an entity 
were in possession of Google’s proprietary source code and wanted to manipulate or 
to abuse Google’s guidelines or relevance, Google could suffer irreparable damage as 
the integrity of, and the public’s confidence in, Google’s quality and scoring would be 
seriously jeopardized. 


Thus, Google made a much stronger case for possible irreparable harm. 


On May 27, 2003, the Court denied SearchKing’s motion for a preliminary injunc- 
tion, and instead, granted Google’s motion to dismiss. The court concluded that Google’s 
PageRanks are entitled to full constitutional protection. Google won this one of their many le- 
gal battles of late. (See the aside on censorship and privacy on page 147.) Those of us engaging 
in hard-fought ethical search engine optimization rejoiced that justice was served. Unethical 
rank-boosting reminds us of a similar unfair practice from our elementary school days—line- 
butting at the water fountain. Nonbutters dislike both the butters (SearchKing clients) and the 
enabler (Bob Massa). Nonbutters feel safer when a teacher is watching. Rest assured that 
Google and other search engines are watching as often as they can. However, some netizens 
argue that the Oklahoma court ruling only plays into the disturbing and growing Googleopoly 
(see the aside on page 112). 


It is not clear exactly how Google devalued SearchKing’s PageRank, whether algorith- 
mically or in an ad-hoc ex postfacto way. One way to incorporate such devaluation algorith- 
mically into the PageRank model is through the personalization vector v7 . The elements in 
v” > 0 corresponding to suspected or known link farming pages can be set to a very small 
number, close to 0. As the iterative PageRank algorithm proceeds, such pages will be deval- 
ued slightly, as the surfer will be less likely to teleport there. Of course, the simpler way to 
devalue spammers’ pages is to assign them PRO after the PageRank calculation is completed. 
The much harder part of the spam problem is the identification of spam pages. 


ASIDE: Google Bombs 


Friday, April 6, 2001, G-Day: Adam Mathes, then a computer science major at Stanford, 
launches the first Google bomb operation. Adam uses his Filler Friday web article to encour- 
age his readers to help deploy the first ever international Google bomb. Readers are instructed 
to make a hyperlink to the homepage of Adam’s friend, Andy Pressman. Adam reported that 
the anchor text of the hyperlink was the key to the Google bomb. Adam’s readers were in- 
structed to make “talentless hack” the anchor text of their new hyperlink, which pointed to 
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Andy Pressman’s page. Adam Mathes had cleverly discovered a loophole in Google’s use 
of anchor text. Given enough links to Andy Pressman’s page with anchor text describing 
that page as “talentless hack,” Google assumes that page really is about talentless hack, even 
though the words may never once appear on Andy’s page. Google added Andy’s page to its 
index under the terms “talentless” and “hack.” From the beginning, Google rightfully noticed 
the descriptive power of anchor text. In fact, anchor text is useful in synonym association. If 
several pages point to a page about autos, but use the term car, the auto page should also be 
indexed under car. Of course, Google bombs have a slow-deploying mechanism—it takes an 
accumulation of links with descriptive anchor text, and thus, time until the content of those 
pages are updated in Google’s index before the bomb explodes. 


Monday, October 27, 2003: George Johnston uses his blog (which is an interactive online di- 
ary) to set off the most famous Google bomb, the “miserable failure” bomb aimed at President 
George W. Bush. Johnston reported that his mission as bomb detonator had been accom- 
plished by late November 2003. In December, entering the query “miserable failure” into 
Google showed the official White House Biography of the President as the number 1 result. 
One reporter noticed that of the over 800 links pointing to the Bush biography, only 32 used 
the phrase “miserable failure” in the anchor text, which meant Google bombing was not only 
fun, it was easy. By January 2004, bombers using the phrase “miserable failure” had to com- 
pete; results showed Michael Moore, President Bush, Jimmy Carter, and Hillary Clinton in 
the top four positions. And of course, other phrases were used by pranksters such as “French 
military victories,” which brought up a Typo Correction page asking “did you mean: French 
military defeats,” and “weapons of mass destruction,” which showed an error page similar to 
the “404 Page Not Found” error page. 


Google’s Reaction: Google initially took a disinterested stance toward Google bombs, claim- 
ing that such games only affected obscure, goofy queries and not their typical serious queries. 
Besides, they claimed that their rankings reflected accurate opinions on the Web; obviously, 
many webpage authors agreed with Johnston that Bush really was a miserable failure. But 
with their June 21, 2004 IPO filing, Google mentioned that the war with spammers including 
these bombers, is “ongoing and increasing,” and that they were stepping up tactics to outsmart 
the spammers and defuse the bombs. 
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Chapter Six 
The Sensitivity of PageRank 


Psychologists say that a person’s sensitivities give insights into the personality. They say 
sensitivity to name-calling might indicate a maligned childhood. Sensitivity to injury, a 
pampered, spoiled upbringing; a short fuse with the boss, anger toward parents, and so on. 
It seems the same is true for the PageRank model. The sensitivities of the PageRank model 
reveal quite a bit about the popularity scores it produces. For example, when a gets very 
close to | (its upperbound), it seems to really get PageRank’s goat. In this chapter, we 
explain exactly how PageRank reacts to changes like this. 


In fact, the sensitivity of the PageRank vector can be analyzed by examining each 
parameter of the Google matrix G separately. In Chapter 5, we emphasized G’s depen- 
dence on three specific parameters: the scaling parameter a, the hyperlink matrix H, and 
the personalization vector v’. We discuss the effect of each of these on the PageRank 
vector in turn in this chapter. 


6.1 SENSITIVITY WITH RESPECT TO a 


In this section, we use the derivative to show the effect of changes in a on m7. The 
derivative is a classical tool for answering questions of sensitivity. The derivative of 17 
with respect to a, written da? (a) /da, tells how much the elements in the PageRank vector 
a” vary when a varies slightly. If element j of dw? (a) /da, denoted dz; (a) /da, is large 
in magnitude, then we can conclude that as a increases slightly, 7; (the PageRank for page 
P;) is very sensitive to small changes in a. The signs of the derivatives also give important 
information; if d7;(a)/da > 0, then small increases in a imply that the PageRank for 
P; will increase. And similarly, dz;(a)/da < 0 implies the PageRank decreases. It is 
important to remember that dz? (a) /da is only an approximation of how elements in 77 
change when a changes, and does not describe exactly how they change. Nevertheless, 
analyzing this derivative can reveal important information about how changes in a affect 
ne, 


Even though the parameter a is usually set to .85, it can theoretically vary between 
0 < a < 1. Of course, G depends on a, and so, G(a) = aS + (1 — a)ev”. The question 
about how sensitive 77(a) is to changes in a can be answered precisely if the derivative 
dm? (a)/da, which gives the rate of change of 77 (a) with respect to small changes in 
a, can be evaluated. But before attempting to differentiate we should be sure that this 
derivative is well defined. The distribution 27 (a) is a left-hand eigenvector for G(a), but 
eigenvector components need not be differentiable (or even continuous) functions of the 
entries of G(a) [127, p. 497], so the existence of dm? (a)/da is not a slam dunk. The 
following theorem provides what is needed. (We have postponed all proofs in this chapter 
until the last section, section 6.5.) 
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Theorem 6.1.1. The PageRank vector is given by 


1 

T 

™ (0) = SAH (Pil), Da(), ---, Dn(@)), 

Via Dil) ( ) 

where D;(q) is the i*” principal minor determinant of order n — 1 inI — G(a). Because 
each principal minor D;(a) > 0 is just a sum of products of numbers from I — G(a), 


it follows that each component in 77 (qa) is a differentiable function of a on the interval 
(0, 1). 


The theorem below provides an upperbound on the individual components of the 
derivative vector as well as an upperbound on the sum of these individual components, 
denoted by the /-norm. 


Theorem 6.1.2. If 7" (a) = (71(a), 72(@),...7,(a@)) is the PageRank vector , then 


< —— for each j = 1,2,...,n, (6.1.1) 
da l-a 
and 
dr™ (a) 2 
: 6.1.2 
| da , l-a ( ) 


The utility of Theorem 6.1.2 is limited to smaller values of a. For smaller values 
of a, Theorem 6.1.2 insures that PageRanks are not overly sensitive as a function of the 
Google parameter a. However, as a — 1, the upperbound (6.1.1) of 1/(1—a@) — co. Thus, 
the bound becomes increasingly useless because there is no guarantee that it is attainable. 


But the larger values of a are the ones of most interest because they give more weight 
to the true link structure of the Web while smaller values of a increase the influence of 
the artificial probability vector v7. Since the PageRank concept is predicated on taking 
advantage of the Web’s link structure, it is natural to choose a closer to 1. Again, it is been 
reported that Google uses a *% .85 [39, 40]. Therefore, more analysis is needed to decide 
on the degree of sensitivity of PageRank to larger values of a. The following theorem 
provides a clear and more complete understanding. 


Theorem 6.1.3. If 77 (ca) is the PageRank vector associated with the Google matrix 
G(a) = aS + (1 —a)ev", then 


d 2. 
wa) = -v'(I-$)(I-a8)~?. (6.1.3) 
In particular, the limiting values of this derivative are 
_ dr™(a) # dx" (a) T # 
iim een (I—S) and iim oo (I-S)*, 


where (x)* denotes the group inverse [46, 122]. 


The dominant eigenvalue 4; = 1 of all stochastic matrices is semisimple [127, p. 
696], so, when S is reduced to Jordan form by a similarity transformation, the result is 


s=x-sx= (4 Q) 1€0(C), as a-s)=x(5 ieee 
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= a-s#=x(5 eos) e 


Matrix C is composed of Jordan blocks J, associated with eigenvalues A; ¢ 1, and the 
corresponding blocks in (I — C)~? are (I — J,)~+. Combining this with Theorem 6.1.3 
makes it clear that the sensitivity of 77 (a) as a — 1 is governed by the size of the entries 
of (I— S)*. ||(I—S)*|| < «(X)||(1 — C)~1]|, where «(X) is the condition number 
of X. Therefore, the sensitivity of 77 (a) as w — 1 is governed primarily by the size of 
||(I— C) ~1||, which is driven by the size of |1 — \2|~+ (along with the index of \2), where 
Aq # 1 is the eigenvalue of S that is closest to A; = 1. In other words, the closer Xz is to 
Ai = 1, the more sensitive 7 (a) is when a is close to 1. 


Generally speaking, stochastic matrices having a subdominant eigenvalue near to 1 
are those that represent nearly uncoupled chains [85] (also known as nearly completely 
decomposable chains). These are chains whose states form clusters such that the states 
within each cluster are strongly linked to each other, but the clusters themselves are only 
weakly linked—i.e., the states can be ordered so that the transition probability matrix has 
the form S = D + €E, where D is block diagonal, ||E|| < 1, and 0 < € < 1 is small 
relative to 1. 


The chain defined by the link structure of the Web is almost certainly nearly uncou- 
pled (weakly linked clusters of closely coupled nodes abound due to specialized interests, 
regional interests, geographical considerations, etc.), so the matrix S can be expected to 
have a subdominant eigenvalue very close to A; = 1. Therefore, as a grows, the PageRank 
vector becomes increasingly sensitive to changes in a, and when a & 1, PageRank is 
extremely sensitive. Putting all of these observations together produces the following con- 
clusions. 


Summary of PageRank Sensitivity 
As a function of the parameter a, the sensitivity of the PageRank vector 77 ( 
to small changes in a is as follows. 


a) 


e For small a, PageRank is insensitive to slight variations in a. 


e As a becomes larger, PageRank becomes increasingly more sensitive to small 
perturbations in a. 


e For a close to 1, PageRank is very sensitive to small changes in a. The degree 
of sensitivity is governed by the degree to which S is nearly uncoupled. 


The Balancing Act 


Larger values of a give more weight to the true link structure of the Web while 
smaller values of a increase the influence of the artificial probability vector v’. 
Because the PageRank concept is predicated on trying to take advantage of the 
Web’s link structure, it’s more desirable to choose a close to 1. But this is where 
PageRank becomes most sensitive, so moderation is necessary—it has been re- 


ported that Google uses a © .85 [39, 40]. 
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We close this section with a numerical example whereby we examine the eigenvalues 
and PageRank vectors of the matrices associated with two related web graphs. 


EXAMPLE 1 A small web graph is pictured in Figure 6.1. 


Figure 6.1 Directed graph for web of seven pages 


Table 6.1 shows the eigenvalues (sorted by magnitude) for the three matrices associ- 
ated with this graph: the raw hyperlink matrix H, the stochastic matrix S, and the Google 
matrix G. It also shows the PageRank values and rank for different values of a. 


Table 6.1 Eigenvalues and PageRank vector for 7-node graph of Figure 6.1 


a= 8 a= .9 a= .99 

o(H) a(S) o(G) TT Rank o(G) aT Rank oa(G) TT Rank 
1 1 1 0641 6 i 0404 6 1 0054 6 
-.50+.87i — -.50+.871 | -.40+.69i 0871 5 - 454.78: 0558 5 -.50+.86i .0075 5 
-.50-.87i -.50-.87i -.40-.69i 1056 4 -.45-.78i 0697 4 -.50+.86i .0096 4 
-.35+.60i -7991 6393 .2372 1 .7192 .2720 1 7911 .3253 1 
-.35-.60i -.33+.6li | -.26+.49i .2256 2. -.30+.55i .2643 2 -.33+.60i .3240 2 
6934 -.33-.61i -.26-.49i 2164 3 -.30-.55i (29:13 3 -.33-.60i 3231 3 
0 0 0 0641 6 0 0404 6 0 0054 6 


According to PageRank, the pages are ordered from most important to least impor- 
tantas(4_ 5 6 3 2 1 7). Table 6.1 reveals several facts. First, |A2(G)| = a since 
S has several eigenvalues on the unit circle, a consequence of the reducibility and period- 
icity of the graph. Second, as a — 1, the PageRank values do change noticeably, however; 
in this example, the actual ranks do not change. Other experiments on larger graphs show 
that the ranks can also change as a — 1 [158]. (We discuss the issue of sensitivity of 
PageRank values versus PageRank ranks later, in Section 6.4). Third, the second largest in 
magnitude eigenvalue of S is .7991. In this section, we emphasized that this value (which 
also measures the degree of coupling of the Markov chain) governs the sensitivity of the 
PageRank vector. Since .7991 is not close to 1, we expect this chain to be rather insensitive 
to small changes. Let’s check this hypothesis. We perturb the chain by adding one hyper- 
link from page 6 to page 5. Thus, row 6 of H changes, so that Hg4 = Hes = .5. Table 6.2 
shows the changes in the eigenvalues and PageRanks. 


After the addition of just one hyperlink the pages are now ordered from most im- 
portant to least importantas(5 6 4 3 2 1 7). Comparing this with the original 
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Table 6.2 Eigenvalues and PageRank vector for perturbed 7-node graph of Figure 6.1 


a=.8 a=.9 a= .99 

o(H) a(S) a(G) TT Rank a(G) TT Rank a(G) TT Rank 
1 1 1 .0641 6 1 .0404 6 1 .0054 6 
-.50+.50i 7991 6393 .0871 5 -7192 0558 5 7911 .0075 5 
-.50-.50i -.50+.50i -.40+.40i -1056 4 -.454+.45i .0697 4 -.50+.50i .0096 4 
.6934 -.50-.50i -.40-.40i -1637 3 45-.45i .1765 3 -.50-.50i 1968 3 
-.35+.60i -.334+.61i -.26-.49i 2664 1 -.30+.55i 3145 1 -.33+.60i 3885 1 
-.35-.60i -.33-.61i -.26-.49i 2491 2 -.30-.55i 3025 2 -.33-.60i 3848 2 
0 0 0 .0641 6 0 .0404 6 0 .0054 6 


ordering, we see that page 4 has moved down the ranked list from first place to third place. 
Comparing the PageRank values for the original chain with those for the perturbed chain, 
we see that only the PageRank values for pages 4, 5, and 6 have changed (again, a conse- 
quence of the reducibility of the chain). 


In Example 2, we consider a related graph in which the second largest in magnitude 
eigenvalue of S is closer to 1. In this case, we expect the PageRank vector to be more 
sensitive to small changes than the PageRank vector for Example 1. 


EXAMPLE2 _ In this example, we apply the intelligent surfer model of section 5.2 rather 
than the democratic random surfer model to the same graph from Example 1. Suppose an 
intelligent surfer determines new hyperlinking probabilities for page 3. See Figure 6.2. 


Figure 6.2 Intelligent surfer’s graph for web of seven pages 


Notice that the intelligent surfer decides to increase the hyperlinking probabilities of 
pages inside the cluster of pages 1, 2, 3, and 7, while drastically decreasing the probability 
of jumping to the other cluster of pages 4, 5, and 6. As a result, the stochastic matrix S is 
much more uncoupled. Of course, we purposely designed this example so that the second 
largest in magnitude eigenvalue of S is closer to 1. The increased degree of coupling is 
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apparent—) 2(S) = .9193 in this example versus .7991 in Example 1. Table 6.3 shows the 
eigenvalues and PageRank vectors associated with this new graph. 


Table 6.3 Eigenvalues and PageRank vector for intelligent surfer graph of Figure 6.2 


a=.8 a= .9 a= .99 

o(H) a(S) o(G) fol Rank o(G) nr Rank o(G) aT Rank 
1 1 1 .0736 6 1 .0538 6 1 .0099 6 
-.50+.8701 -.50+.87i -.40+.701 1324 5 -.45+.781 .1022 5 -.50+.861 0197 5 
-.50-.87i -.50-.87i -.40-.70i 1429 4 -.45-.78i 1132 4 -.50+.861 0224 4 
8378 -9193 7354 1943 1 8274 2271 1 9101 3130 1 
-.424+.43i -.394.44i -.314+.361 1924 2; -.354+.40i 2256 2. -.384+.44i 3127 2 
-.42-.43i -.39-.44i -.31-.361 1909 3 -.35-.40i 2242 3 -.384+.44i 3124 3 
0 0 0 .0736 6 0 0538 6 0 .0099 6 


Notice that the pages in Figure 6.2 are ordered from most important to least impor- 
tantas(4 5 6 3 2 1 7). Now let’s make the same perturbation that we did in 
Example | (add a hyperlink from page 6 to page 5, making Hg4 = Hes = .5). Table 6.4 
shows the effect on the PageRank vector. 


Table 6.4 Eigenvalues and PageRank vector for perturbed intelligent surfer graph of Figure 6.2 


a=.8 a=.9 a= .99 

o(H) a(S) o(G) nT Rank o(G) nT Rank oa(G) nT Rank 
1 1 1 .0736 6 1 0538 6 1 .0099 6 
8378 9193 7354 1324 4 8274 1022 5 9101 .0197 5 
-.50+.50i -.50+.50i -.40+.40i 1429 3 -.454+.45i 1132 4 -.50+.50i .0224 4 
-.50-.50i -.50-.50i -.40-.40i 1294 bs} -.45-.45i 1439 3 -.50-.50i 1889 3 
-.424+.45i -.394+.44i -.314+.361 2284 1 -.35+.40i 2694 1 -.384+.44i 3750 1 
-.42-.45i -.39-.44i -.31-.36i 2197 2 -.35-.40i .2636 2 -.38-.44i 3741 2 
0 0 0 .0736 6 0 0538 6 0 .0099 6 


After the perturbation, the pages are ordered from most important to least important 
as(5 6 3 2 4 1 7). Page 4 slides much farther down the ranked list. Both the 
rankings and the PageRank values are more sensitive in Example 2 than Example | to the 
same small perturbation. This clearly demonstrates the effect of A2(S) on the sensitivity 
of the PageRank vector. 


Very recently, researchers from the University of Southern California have studied 
the behavior of PageRank with respect to changes in a in order to detect link spammers 
[164]. Their results are promising. Their technique is successful in identifying “colluding” 
pages, pages that are in collusion to boost each other’s PageRank through a link farm or link 
exchange scheme. They also define a slightly modified PageRank algorithm that decreases 
the value of links from the identified colluding pages. 


Italian researchers have extended work on the sensitivity of PageRank with respect 
to a by examining higher-order derivatives than the simple first-order derivatives of this 
section [32]. 


6.2 SENSITIVITY WITH RESPECT TO H 


The question in this section is: how sensitive is 7’ to changes in the H? Traditional per- 
turbation results [121] say that for a Markov chain with transition matrix P and stationary 
vector 1? 


a’ is sensitive to perturbations in P <> |A2(P)| ~ 1. 
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For the PageRank problem, we know that |A2(G)| < a, and further, for a reducible S, 
A2(G) = a. Therefore, as « — 1, the PageRank vector becomes more and more sensitive 
to changes in G, a result from the previous section. However, G depends on a, H, and 
v’,, so in this section we would like to isolate the effect of hyperlink changes (the effect 
of changes to H on the sensitivity of the PageRank vector). We can squeeze a little more 
information about the sensitivity with respect of hyperlink changes by computing another 
derivative. 

dx” (hij) 


aS am(e? — v?)(I— aS)71. (6.2.1) 


j 

The effect of a is clear. As a — 1, the elements of (I — aS)~! approach infinity, and the 
PageRank vector is more sensitive to small changes in the structure of the web graph. But 
another result appears, a rather common sense result: adding a link or increasing the weight 
of a link from an important page (i.e., 7; is high) has a greater effect on the sensitivity of 
the PageRank vector than changing a link from an unimportant page. 


6.3 SENSITIVITY WITH RESPECT TO v7 


Lastly, we consider the effect of changes in the personalization vector v’. We begin by 
computing the derivative of #7 with respect to v’. 
d oe eee a 
on = - ata mt 08), (6.3.1) 
1€D 
where D is the set of dangling nodes. 


Equation 6.3.1 gives two insights into the sensitivity of 77 with respect to v”. First, 
there is the dependence on a. As a — 1, the elements of (I — aS)~+ approach infinity. 
Again, we conclude that as a — 1, m” becomes increasingly sensitive. Nothing new 
there. However, the second interpretation gives a bit more information. If the dangling 
nodes combine to contain a large proportion of the PageRank (i.e., }>,< p 7 is large), then 
the PageRank vector is more sensitive to changes in the personalization vector v’. This 
agrees with common sense. If collectively the set of dangling nodes is important, then 
the random surfer revisits them often and thus follows the teleportation probabilities given 
in v" more often. Therefore, the random surfer’s actions, and thus the distribution of 
PageRanks, are sensitive to changes in the teleportation vector v7. 


A Fundamental Matrix for the PageRank problem 


Because the matrix (I—aS)~! plays a fundamental role in the PageRank problem, 
both in the sensitivity analysis of this chapter and the linear system formulation 
of the next chapter, we call it the fundamental matrix of the PageRank problem. 


6.4 OTHER ANALYSES OF SENSITIVITY 


Three other research groups have examined the sensitivity and stability of the PageRank 
vector; Ng et al. at the University of California at Berkeley, Bianchini et al. in Siena, Italy 
and Borodin et al. at the University of Toronto. All three groups have computed some 
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version of the following bound on the difference between the old PageRank vector 77 and 
the new, updated PageRank vector a (29, 113, 133]. 


T _ =T 20 
a ales geen MH 
where U is the set of all pages that have been updated. (The proof is given in section 6.5, 
p. 69.) This bound gives another sensitivity interpretation: as long as a is not close to | 
and the updated pages do not have high PageRank, then the updated PageRank values do 


not change much. Let’s consider the two factors of the bound, 2a/(1 — a) and 0-4 7- 


As an example, suppose a = .8 and the sum of the old PageRanks for all updated 
pages, Do jcy Ti» is 10~°. Then the multiplicative constant 20/(1 — a) = 8, which 
means that the l-norm of the difference between the old PageRank vector and the up- 
dated PageRank vector, ||? — 7”||, is at most 8 x 10~°. Consequently, in this case, 
the PageRank values are rather insensitive to the Web’s updates. As a — 1, the bound 
becomes increasingly useless. The utility of the bound is governed by how much } 7-47 7 
can offset the growth of 2a/(1 — a). Two things affect the size of }>,-y 7: the number 
of updated pages and the PageRanks of those updated pages. This exposes another limi- 
tation of the bound. It provides no help with the more interesting and natural question of 
“what happens to PageRank when the high PageRank pages are updated?” For example, 
how do changes to a popular, high rank page like the Amazon webpage affect the rankings? 
Section 6.2 provided a more complete answer to this question. 


PageRank and Link Spamming 


The difference between the old PageRank vector 77 and the updated PageRank 
vector 77 can be bounded as follows: 


sf 2a 
Fis ar Sree Da, ce (6.4.1) 


tE€U 

where U is the set of all pages that have been updated. 

e This bound is useful when a is small and the set of updated pages have small 
aggregate PageRank. It implies that as long as a is not close to | and the updated 


pages do not have high PageRank, then the updated PageRank values do not 
differ greatly from the original PageRank values. 


On the other hand, the bound does not tell us how sensitive PageRank is to 
changes in popular, high PageRank pages. 


Using the bound of (6.4.1), researchers [29] have made the following statement 
about the effectiveness of link spamming: 


. a nice property of PageRank [is] that a community can only make a 
very limited change to the overall PageRank of the Web. Thus, regard- 
less of the way they change, non-authoritative communities cannot affect 
significantly the global PageRank. 


This bound reinforces the philosophy that the optimizing game is to either get sev- 
eral high PageRank pages or many lower PageRank pages to point to your page. 
See [12] for other mathematically optimal linking strategies regarding PageRank. 
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A fourth group of researchers recently joined the stability discussion. Ronny Lempel 
and Shlomo Moran, the inventors of the SALSA algorithm [114] (see Chapter 12), have 
added a further distinction to the definition of stability. In [115], they note that the stability 
of an algorithm, which concerns volatility of the values assigned to pages, has been well 
studied. What has not been studied is the notion of rank-stability (first defined and studied 
by Borodin et al. [36]), which addresses how volatile the rankings of pages are with respect 
to changes in the underlying graph. As an example, suppose 


nr’ =(.198 199 .20 .201 .202) and 
wr =(.202 201 20 .199 .198). 


The original and updated PageRank values have not changed much, ||~7 — 7 *| 1 = .012, 
and yet the rankings have flipped. Lempel and Moran show that stability of PageRank val- 
ues does not imply rank-stability. In fact, they provide a small example demonstrating that 
a change in just one outlink of a very low ranking page can turn the entire ranking upside 
down! They also introduce the interesting concept of running-time stability, challenging 
researchers to examine the effect of small perturbations in the graph on an algorithm’s 
running time. 


REMARK: From the start of the book, we’ve emphasized the Web’s dynamics. However, 
while the content of webpages does change very often, we are concerned only with changes 
to the graph structure of the Web. Graph changes affect the PageRank vector, whereas 
content changes affect the inverted index of Chapters | and 2. Updates to the Web’s graph 
can be of two types: link updates or node updates. The analyses of sensitivity and updating 
in this chapter all assume that the Web’s updates are only link updates, which refers to the 
addition or removal of hyperlinks. Node updates, the addition or removal of webpages, 
are not considered. Analyzing node updates is a much harder problem, which we postpone 
until Chapter 10. 


ASIDE: _ RankPulse 


The website www. rankpulse.com uses Google’s Web Application Programming 
Interface (Web API, see the aside on page 97) to monitor the pulse of the top ten rankings for 
1,000 queries. Even though exact PageRank values are not available, the RankPulse authors 
have developed a clever workaround. They track only the first page of Google results (the top 
ten list) for a query like “basketball,” noticing how the ten sites jockey for position. Every day 
they record the websites and their positions in the top ten list. Then they plot these rankings 
over time. Figure 6.3 shows a RankPulse chart for “basketball” on July 26, 2004. 


We have included only the plots for five of the available 10 websites to reduce the 
clutter on the chart. The sites www.nba.com and www. basketball.com are historical 
fixtures in the top two spots, while www. basketbal1.ca bounces in and out of the bottom 
part of the top ten. The other two sites, www. fiba.com and www.wnba.com fluctuate 
among the top ten. The fluctuations for the WNBA site can be explained by the league’s 
seasonal schedule and frequent updates—big college tournaments in March, predictions for 
and the lead up to the April draft, then the May through September season. 


Since Google’s overall rankings are a combination of content scores and PageRank 
popularity scores, it is hard to isolate the sensitivity of PageRank in the RankPulse charts. 
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Figure 6.3 RankPulse chart for basketball 


Nevertheless, these charts give some approximation to the sensitivity of Google rankings for 
select pages. 


6.5 SENSITIVITY THEOREMS AND PROOFS 
Theorem 6.1.1 The PageRank vector is given by 


1 
T 
w (a) =m" yD, Di(a), D2(a), fetes Dn(@) ’ 
Via Dil) ( ) 
where D;(c) is the i*” principal minor determinant of order n — 1 inI — G(a). Because 
each principal minor D;(a) > 0 is just a sum of products of numbers from I — G(a), 
it follows that each component in 77 (a) is a differentiable function of a on the interval 
(0, 1). 


Proof. For convenience, let G = G(a), m7 (a) = m7, D; = D;(a), and set A =I-G. 
If adj (A.) denotes the the transpose of the matrix of cofactors (often called the adjugate or 
adjoint), then 
Al[adj(A)] = 0 = [adj (A)JA. 

It follows from the Perron—Frobenius theorem that rank (A) = n-—1, and as a result 
rank (adj (A.)) = 1. Furthermore, Perron—Frobenius insures that each column of [adj (A.)] 
is a multiple of e, so [adj(A)] = ew” for some vector w. But [adj(A)]ii = Dj, so 
w! = (Dj, Do, ..., Dy). Similarly, [adj (A)|A = 0 insures that each row in [adj (A)] 
is a multiple of 77 and hence w? = az” for some a. This scalar a cannot be zero; oth- 
erwise [adj (A)] = 0, which is impossible. Therefore, w’e = a # 0, and w! /(w7e) = 
wi/a=n. & 


Theorem 6.1.2 If" (a) = (71(a),72(a),...7n(a)) is the PageRank vector, then 


1 
BENG) 23 for each j = 1,2,...,n, (6.1.1) 
da l-a 
and 
dx™ (a) 2 
6.1.2 
| da ||, l-a ( ) 
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Proof. First compute dz? (a) /da by noting that 77(a)e = 1 implies 
d T 
mE (a) 
da 
Using this while differentiating both sides of 
mw’ (a) =m" (a)(aS + (1 —a)ev") 


yields 
dn" (a) T P 
(I— aS) = 7 (a)(S — ev’ ). 
da 
Matrix I — aS(q) is nonsingular because w < 1 guarantees that p(aS(a)) < 1, so 
d T 
ze) =n" (a)(S — ev")(I- aS)~?. (6.5.1) 


The proof of (6.1.1) hinges on the following inequality. For every real x € e+ (the orthog- 
onal complement of span{e}), and for all real vectors yp x1, 


Ixy] < |Ix|l1 (Hee | : (6.5.2) 
This is a consequence of Hélder’s inequality because for all real a, 
Ix7y| = |x" (y — ae)| < ||xllally — cello, 


and ming |/y — ael|oo = (Ymax — Ymin)/2, where the minimum is attained when 
Q@ = (Ymax + Ymin)/2. It follows from (6.5.1) that 
dij (a) 
da 
where e; is the qth standard basis vector (i.e, the gh column of I,,,.,,). Since it’s true that 
mt (a)(S — ev’ Je = 0, inequality (6.5.2) may be applied with 


y =(I-aS)"'e; 


=n" (a)(S— ev’)(I— aS)~'e,, 


to obtain 


dt; a Ymax — Ymin 
SRO] < Ila (0)(8 — ev?) (Hota). 
a 
But ||? (a)(S — ev”) ||, < 2, so 
dr;(a 
ani(e) S Ymax — Ymin- 
a 


Now use the fact that (I — aS)~! > 0 together with the observation that 
(I—aS)e=(l1—a)e = (I—aS)'e=(1—a)'e 
to conclude that Ymin > 0 and 


1 


—-a 


Ymax < max [(I = oe) male < ||(I- aS)~* |loo = |\(I- aS)~*el| 0 = I 
iJ 
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Consequently, 
di (a) 


da 
which is (6.1.1). Inequality (6.1.2) is a direct consequence of (6.5.1), along with the above 
observation that 


1 


Baits 1—aQ’ 


1 


(1 — aS)“ loo = [| - eS) “elo = —- 


Theorem 6.1.3 If 77 (ca) is the PageRank vector associated with the Google matrix 
G(a) = aS + (1 —a)ev", then 


dx™ (a) T 2 
= I-S)(I Ss)". wl. 
1 = -v"(1-$)(I- a8) (6.1.3) 
In particular, the limiting values of this derivative are 
_ dr™(a) ‘ _ dx (a) T # 
jim, 7 mes (I—S) and jim, a (I-—S)”, 


where (x)*# denotes the group inverse [46, 122]. 
Proof. Multiplying 0° = 2" (a)(I— aS — (1 — a)ev”) on the right by (I — aS)~! 
yields 

07 = nr" (a)(I- (1—a)ev’ (I—-aS)') = a" (a) = (1—a)v' (I-a8)}. 
Using the formula dA(a)~!/da = —A~1(a)[dA(a)/do|A~1(q) for differentiating an 


inverse matrix [127, p. 130] together with the fact that (I — S) commutes with (I— aS)~+ 
produces 


=(1—a)v7 (I— aS)"'!S(I— aS)~1 — v7 (I— aS)"! 
=-v"(I— aS)! [I- (1—a)S(I- a$)“"] 
=—v"(I—aS)"!(I— aS — (1—a)S)(I— aS)" 

( 


=-v"(I-aS8)-!(I- 8)(I- aS)" 
=-v'(I—S)(I-a8)~?. 


By definition, matrices Y and Z are group inverses of each other if and only if YZY = Y, 
ZYZ = Z, and YZ = ZY, so it’s clear that if 


Y¥ (a) = (I—S)(I— oS) ?Z(a) = I- S)*# (1 - aS)’, 
then 


#;.\ _} Y(@) fora <1, 
Zs ays 2) fora = 1. 


Therefore, by continuity properties of group inversion [46, p. 232], it follows that 


lim ¥(a) = lim [Z*(a)] = [im Z(a)| * _q_s)#, 


al avl acl 


and thus 
_ dr™(a) 
lim 
al da 


=-v'(I-S)*. IJ 
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Theorem 6.5.1. Suppose G = aS + a —a)ev" is the Google matrix with PageRank 
vector 7? and G = aS + (1 - ajevT is the updated Google matrix (of the same size) 
with corresponding PageRank vector a’. Then 


2a 

T  =T 

|r - 7 |], < ee 
tEU 

where U is the set of all pages that have been updated. 


Proof. Let F be the matrix representing the perturbation between the two stochastic ma- 
trices S and S. Thus, F = S — S. Then, 


nw! — # =a7n™S—an'S 


a 


=an'S—a(z —x?+n°7)S 


=an'S—an'’S+a(n? — %7)S 


=an'F + a(n? — 7")S. 


Solving for 77 — a gives 


Computing norms, we obtain 
llr? — #1 Salle F|la||(1— a8)" IJ00 
*F ili. 


a 
Pa 
See [108] for theorems and proofs showing that I — aS is nonsingular and has row sums 


of 1/(1 — a). Now reorder F (and 77) so that the rows corresponding to updated pages 
(nonzero rows) are at the top of the matrix. Then 


nF =(at ot) (0) = 71 Fi. 


Therefore, ||w7 Fl], = |l7T Filly < |laf lla l|Filloo. And ||Filloo = [$1 — Silloo < 
Si loo + ||Sill|oc = 2, where S, and S; also correspond to the updated pages. Therefore, 
lz F |] < 20, cy m- Finally, 


2a 
T  <T 
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Chapter Seven 


The PageRank Problem as a Linear System 


Abraham Lincoln, in his humorous, self-deprecating style, said “If I were two-faced, would 
I be wearing this one?” Honest Abe wasn’t, but the PageRank problem is two-faced. 
There’s the eigenvector face it was given by its parents, Brin and Page, at birth, and there’s 
the linear system face, which can be arrived at with a little cosmetic surgery in the form 
of algebraic manipulation. Because Brin and Page originally conceived of the PageRank 
problem as an eigenvector problem (find the dominant eigenvector for the Google matrix), 
the eigenvector face has received much more press and fanfare. However, the normalized 
eigenvector problem 77 (aS + (1—a)ev”) = x7 can be rewritten, with some algebra as, 


a (I — aS) = (1—a)v". (7.0.1) 


This linear system is always accompanied by the normalization equation w7e = 1. The 
question is: which face should PageRank be wearing, or does it even matter? By the end 
of the chapter we will answer these questions about the two-faced PageRank. 


7.1 PROPERTIES OF (I — aS) 


In Chapter 4 we learned a lot about PageRank by discussing the properties of the Google 
Markov matrix G in the eigenvector problem. Now it’s time to carefully examine the linear 
system formulation of equation (7.0.1). Below are some interesting properties of the co- 
efficient matrix in this equation. (The proofs of these statements are very straightforward. 
See the books by Berman and Plemmons [21], Golub and Van Loan [82] or Meyer [127].) 


Properties of (I — aS): 


1. (I— aS) is an M-matrix. 


N 


. (I= aS) is nonsingular. 

. The row sums of (I — aS) are 1 — a. 

. ||T-aSllo =1+a. 

. Since (I — aS) is an M-matrix, (I— aS)~! > 0. 


. The row sums of (I—aS)~! are (1—a)~1. Therefore, ||(I—aS)~*||,, = (1—-a)71. 


AY DA nN HR Ww 


. Thus, the condition number k.(I — aS) = (1+ a)/(1- a). 


These are nice properties for (I — aS). However, recall that (I — aS) can be pretty 
dense, whenever the number of dangling nodes is large because these completely sparse 
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rows are replaced with completely dense rows. We like to operate, whenever possible, on 
the very sparse H matrix. And so we wonder if similar properties hold for (I — aH). 


7.2 PROPERTIES OF (I — aH) 


Using the rank-one dangling node trick (i.e., S = H + av”), we can once again write the 
PageRank problem in terms of the very sparse hyperlink matrix H. The linear system of 
equation (7.0.1) can be rewritten as 


nm (I — aH — aav") = (1—a)v". 
If we let 77a = 4, then the linear system becomes 
nm? (I— aH) =(1-—a+ay)v’. 


The scalar y holds the aggregate PageRank for all the dangling nodes. Since the normal- 
ization equation 777 e = 1 will be applied at the end, we can arbitrarily choose a convenient 
value for y, say y = 1 [55, 80, 109, 138]. We arrive at the following conclusion. 


Theorem 7.2.1 (Linear System for Google problem). Solving the linear system 
x! (I— aH) =v" (7.2.1) 


and letting x’ = x" /x" e produces the PageRank vector. 


In addition, (I — aH) has many of the same properties as (I — aS). 
Properties of (I — wH): 


1. (I— aH) is an M-matrix. 
2. (I — aH) is nonsingular. 


3. The row sums of (I — aH) are either 1 — a for nondangling nodes or | for dangling 
nodes. 


4, ||I-—aH||, =1l+a. 
5. Since (I — aH) is an M-matrix, (I— aH)~! > 0. 


6. The row sums of (I — wH)~! are equal to 1 for the dangling nodes and less than or 
equal to ss for the nondangling nodes. 


7. The condition number k.(I — aH) < ++2. 


Q 


8. The row of (I — aH)~! corresponding to dangling node i is e/ , where e; is the i*” 
column of the identity matrix. 
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Linear System for PageRank problem 


The sparse linear system formulation of the PageRank problem is 


x?(I—oH)=v? with a? =x/x7e. 


Like the eigenvector formulation, the PageRank problem has a very sparse linear 
system formulation (with at least eight nice properties). Solve both and you get the same 
vector, the PageRank vector. So what’s the point? There are several good reasons for 
remembering that PageRank is two-faced. First, for a small problem, such as comput- 
ing a ranking for a company Intranet, a direct method applied to the linear system is 
much faster than the power method. Try this out with Matlab. Compare the PageRank 
power method code from page 51 with some of Matlab’s built-in linear system solvers, 
e.g., pi=v/ (eye (n) -alpha*H). Second, in Chapter 5 we warned that as a — 1, the 
power method takes an increasing amount of time to converge. However, the solution time 
of the direct method is unaffected by the parameter a. So a can be increased to capture 
the true essence of the Web, giving less weight to the artificial teleportation matrix. But, 
don’t forget the sensitivity issues of Chapter 6. Unfortunately, the PageRank vector is 
sensitive as a — 1 regardless of the problem formulation [100]. Third, thinking about 
PageRank as a linear system opens new research doors. Nearly all PageRank research 
has focused on solving the eigenvector problem. Researchers have recently begun exper- 
imenting with new PageRank techniques such as preconditioners, multigrid methods, and 
reorderings [55, 80, 109]. In fact, a group from Yahoo! recently tried popular linear system 
iterative methods such as BiCGSTAB and GMRES on several large web graphs [80]. The 
preliminary results for some of these methods look promising; see Section 8.4. 


Google Hacks 


The O’Reilly book, Google Hacks: 100 Industrial-Strength Tips and Tools [44], 
shows readers that there’s more to Google than most people know. Google pro- 
vides a customizable interface as well as an even more flexible programming in- 
terface (Google’s Web API; see the aside on page 97) that allows users to exercise 
their creativity with Google. If you know how to use it and Google Hacks helps, 
Google is, to name just a few, an entertainment, research, social, informational, 
news, archival, spelling, calculating, shopping, and email tool all rolled into one 
user-friendly package. 


7.3 PROOF OF THE PAGERANK SPARSE LINEAR SYSTEM 
Theorem 7.3.1. Solving the linear system 
x" (I— aH) =v" (7.3.1) 


and letting x? = x" /x" e produces the PageRank vector. 
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Proof. mm is the PageRank vector if it satisfies 77G = a7 and m7e = 1. Clearly, 
me = 1. Showing 7G = x” is equivalent to showing 77(I — G) = 07, which is 
equivalent to showing x7 (I — G) = 07. 

x7 (I— G) =x" (I — aH — aav’ — (1 — a)ev") 


T (I — aH) — x7 (aa+ (1 —aje)v" 


The above line results from the fact that x7 (aa + (1 — a)e)v? = 1 because 


1=vte 
=x’ (I-—aH)e 
=x’e- ax’ He 
=x7e—- ax’ (e—a) 


=(1—a)x’e+ax’a. §f 


Chapter Eight 


Issues in Large-Scale Implementation of PageRank 


On two occasions, I have been asked [by members of Parliament], ‘Pray, Mr. 
Babbage, if you put into the machine wrong figures, will the right answers 
come out?’ Iam not able to rightly apprehend the kind of confusion of ideas 
that could provoke such a question. —Charles Babbage, designer of the Ana- 
lytical Machine, a prototype of the first computer 


That’s a funny quote, but of course, for us the question is: if you put the right 
(in our case, arbitrary) figures into the PageRank machine, do you get the right answers 
out? Simple enough to answer. Just check that, for any input 7”, the output satisfies 
a\)TG = xT up to some tolerance. However, when the problem size grows dramat- 
ically, crazy things can happen and simple questions aren’t so simple. It’s hard to even 
put numbers into the machine, it’s hard to make the machine start running, and it’s hard to 
know whether you have the right answer. 


We’ve all had firsthand experiences with problems of scale. Things don’t always 
scale up nicely. Strategies for babysitting two or three kids just don’t work when you’re 
counseling 15-20 campers. Translating teaching strategies for 35 students to 120 students 
doesn’t work either. (At least one of your authors found this out the hard way.) In this 
chapter, we’ll talk about important issues that arise when researchers scale the PageRank 
model up to web-sized proportions. For instance, how do you store G when it’s of order 
8.1 billion? How accurate should the PageRank solution be? And how should dangling 
nodes be handled? These are substantial issues at the scale of the World Wide Web. 


8.1 STORAGE ISSUES 


Every adequate search engine requires huge storage facilities for archiving information 
such as webpages and their locations; inverted indexes and image indexes; content score 
information; PageRank scores; and the hyperlink graph. The 1998 paper by Brin and Page 
[39] and more recent papers by Google engineers [19, 78] provide detailed discussions of 
the many storage schemes used by the Google search engine for all parts of its information 
retrieval system. The excellent survey paper by Arasu et al. [9] also provides a section on 
storage schemes needed by any web search engine. Since this book deals with mathemati- 
cal link analysis algorithms, we focus only on the storage of the mathematical components, 
the matrices and vectors, used in the PageRank part of the Google system. 


Computing the PageRank vector requires access to the items in Table 8.1. Here 
nnz(H) is the number of nonzeros in H, |D| is the number of dangling nodes, and n is the 
number of pages in the web graph. When v“, the personalization vector, is the uniform 


vector (v? = e” /n), no storage is required for v”. 
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Table 8.1 Storage requirements for the PageRank problem 


Entity | Description Storage 

H sparse hyperlink matrix nnz(H) doubles 

a sparse binary dangling node vector | |D| integers 

vr dense personalization vector n doubles 

aw(*)T | dense current iterate of PageRank | n doubles 
power method 


Since there are roughly 10 outlinks per page on average, nnz(H) is about 10n, 
which means that of the entities in Table 8.1, the sparse hyperlink matrix H requires the 
most storage. Thus, we begin our discussion of storage for the PageRank problem with H. 
The size of this matrix makes its storage nontrivial, and at times, requires some creativity. 
The first thing to determine about H is whether or not it will fit in the main memory of the 
available computer system. 


For small subsets of the Web, when H fits in main memory, computation of the 
PageRank vector can be implemented in the usual fashion (e.g., using code similar to the 
Matlab programs given on pages 42 and 51). However, when the H matrix does not fit in 
main memory, a little more ingenuity (and complexity) is required. When a large hyper- 
link matrix exceeds a machine’s memory, there are two options: compress the data needed 
so that the compressed representation fits in main memory, then creatively implement a 
modified version of PageRank on this compressed representation, or keep the data in its 
uncompressed form and develop I/O (input/output)-efficient implementations of the com- 
putations that must take place on the large, uncompressed data. 


Even for modest web graphs for which the hyperlink matrix H can be stored in 
main memory (meaning compression of the data is not essential), minor storage techniques 
should still be employed to reduce the work involved at each iteration. For example, for the 
random surfer model only, the H matrix can be decomposed into the product of the inverse 
of the diagonal matrix D holding outdegrees of the nodes and the adjacency matrix L of 
0’s and 1’s. First, the simple decomposition H = D~!L, where [D~!],;; = 1/d,; if iis a 
nondangling node, 0 otherwise, saves storage. Rather than storing nnz(H) real numbers 
in double precision, we can store n integers (for D) and nnz(H) integers (for the locations 
of 1’s in L). Integers require less storage than doubles. Second, H = D~'!L reduces the 
work at each PageRank power iteration . Each power iteration is executed as 


wtDT — gg TH + (ar Tat 1—a)v’. 


The most expensive part, the vector-matrix multiplication 7") H, requires nnz(H) mul- 
tiplications and nnz(H) additions. Using the vector diag(D~'), a“) H can be accom- 
plished as 7)? D-!L = (a7). % (diag(D~1))L, where .* represents componentwise 
multiplication of the elements in the two vectors. The first part, (x7). * (diag(D~')) re- 
quires n multiplications. Since L is an adjacency matrix, (#7). « (diag(D~'))L now 
requires a total of n multiplications and nnz(H) additions. Thus, using the H = D~!L 
decomposition saves nnz(H) — n multiplications. Unfortunately, this decomposition is 
limited to the random surfer model. For the intelligent surfer model, other compact stor- 
age schemes [18], such as compressed row storage or compressed column storage, may be 
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used. Of course, each compressed format, while saving some storage, requires a bit more 
overhead for matrix operations. 


As mathematicians, we see things as matrices and vectors, and so like to think of 
H, mkt v"’, and aas stored in matrix or compressed matrix form. However, computer 
scientists see arrays, stacks, and lists, and therefore, store our matrices as adjacency lists. 


The web-sized implementations of the PageRank model store the H (or L) matrix 
in an adjacency list of the columns of the matrix [139]. In order to compute the PageRank 
vector, the PageRank power method requires vector-matrix multiplications of #)7H at 
each iteration k. Therefore, quick access to the columns of the matrix H (or L) is essential 
to algorithm speed. Column 7 contains the inlink information for page 2, which, for the 
PageRank system of ranking webpages, is more important than the outlink information 
contained in the rows of H (or L). Table 8.2 is an adjacency list representation of the 
columns of L for the tiny 6-node web in Figure 8.1. 


Figure 8.1 Tiny 6-node web 


Table 8.2 Adjacency list for random surfer model of Figure 8.1 


Node | Inlinks from 
1 3 
2: 1,3 
3 1 
4 5,6 
5 3,4 
6 4,5 


Exercise 2.24 of Cleve Moler’s recent book Numerical Computing with Matlab [132] 
gives one possible implementation of the PageRank power method applied to an adja- 
cency list, along with sample Matlab code (PageRankpow.m) that can be downloaded 
from http: //www.mathworks.com/moler/. When the adjacency list does not fit in 
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main memory, references [139, 141] suggest methods for compressing the data. 


Because of their potential and promise, we briefly discuss two methods for com- 
pressing the information in an adjacency list, the gap technique [25] and the reference 
encoding technique [140, 141]. The gap method exploits the locality of hyperlinked pages. 
Locality refers to the fact that the source and destination pages for a hyperlink are often 
close to each other lexicographically. A page labeled 100 often has inlinks from pages 
nearby lexicographically such as pages 112, 113, 116, and 117 rather than pages 924 and 
4,931,010. Based on this locality principle, the information in an adjacency list for page 
100 is stored as below. 


Node = Inlinks from 
100 112 020 


The label for the first page inlinking to page 100, which is page 112, is stored. After 
that, only the gaps between subsequent inlinking pages are stored. Since these gaps are 
usually nice, small integers, they require less storage. 


The other graph compression method, reference encoding, exploits the similarity be- 
tween webpages. If pages P; and P; have similar adjacency lists, it is possible to compress 
the adjacency list of P; by representing it in terms of the adjacency list of P;, in which 
case P; is called a reference page for P;. Pages within the same domain might often share 
common outlinks, making the reference encoding technique attractive. Consider the ex- 
ample in Figure 8.2, taken from [141]. The adjacency list for page P; looks a lot like the 


Adjacency List 


P, | 5 | 7 [12] 89 | 101 | 190 | 390 | 


reference encode 
P; 5 | 6 | 12 | 50] 101 | 190 |} ————__—_____ ] 1010110 6 | 50 
P; in terms of P; 


Figure 8.2 Reference encoding example 


adjacency list for P;. In fact, both pages have outlinks to pages 5, 12, 101, and 190. In 
order to take advantage of this repetition, we need to create two vectors: a sharing vector 
of 1’s and 0’s and a dissimilarity vector of integers. The binary sharing vector has the 
same size as the adjacency list of P;, and contains a 1 in the k“” position if entry k of P,’s 
adjacency list appears in P;’s adjacency list. The second vector in the reference encoding 
is a list of all entries in the adjacency list of P; that are not found in the adjacency list of its 
reference P;. Of course, the sharing vector for P; requires less storage than the adjacency 
list for P;. Therefore, the effectiveness of reference encoding depends on the number of 
dissimilar pages. P; is a good reference page for P; if the overlap between the adjacency 
lists for the two pages is high, which means the dissimilarity vector is short. However, it’s 
not easy to determine a reference page for each page in the index, so some guidelines are 
given in [140]. Both the gap method and the reference encoding method are used, along 
with other compression techniques, to impressively compress the information in a standard 
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web graph. These techniques are freely available in the efficient graph compression tool 
WebGraph, which is produced by Paolo Boldi and Sebastiano Vigna [33, 34]. 


References [47, 86] take the other approach; rather than compressing the matrix 
information, they suggest I/O-efficient implementations of PageRank. In addition, because 
the PageRank vector itself is large and completely dense, containing over 4.3 billion pages, 
and must be consulted in order to process each user query, Haveliwala [87] has suggested a 
technique to compress the PageRank vector. This encoding of the PageRank vector hopes 
to keep the ranking information cached in main memory, thus speeding query processing. 


8.2 CONVERGENCE CRITERION 


The power method applied to G is the predominant method for finding the PageRank vec- 
tor. Being an iterative method, the power method continues until some termination crite- 
rion is met. In Chapter 4, we mentioned the traditional termination criterion for the power 
method: stop when the residual (as measured by the difference of successive iterates) is less 
than some predetermined tolerance (i.e., ||a*+Y? — wT ||, < 7). However, PageRank 
researcher Taher Haveliwala [86] has rightfully noted that the exact values of the PageRank 
vector are not as important as the correct ordering of the values in this vector. That is, iter- 
ate until the ordering of the approximate PageRank vector obtained by the power method 
converges. Considering the scope of the PageRank problem, saving just a handful of itera- 
tions is praiseworthy. Haveliwala’s experiments show that the savings could be even more 
substantial on some datasets. As few as 10 iterations produced a good approximate order- 
ing, competitive with the exact ordering produced by the traditional convergence measure. 
This raises several interesting issues: How do you measure the difference between two or- 
derings? How do you determine when an ordering has converged satisfactorily? Or better 
yet, is it possible to write a “power method” that operates on and stores only orderings, 
rather than PageRank values, at each iteration? Several papers [65, 68, 69, 86, 88, 120] 
have provided a variety of answers to the question of comparing rank orderings, using 
such measures as Kendall’s Tau, rank aggregation, and set overlap. 


8.3 ACCURACY 


Another implementation issue is the accuracy of PageRank computations. We do not know 
the accuracy with which Google works, but it at least has to be high enough to differentiate 
between the often large list of ranked pages that Google commonly returns. Since 7” is 
a probability vector, each 7; will be between 0 and 1. Suppose 7 is a 1 by 4 billion 
vector. Since the PageRank vector is known to follow a power law or Zipfian distribution 
[16, 70, 136], itis possible that a small section of the tail of this vector, ranked in decreasing 


order, might look like: 
nm’ =(--- .000001532 .0000015316 .0000015312 .0000015210 ---). 


Accuracy at least on the order of 10~° is needed to distinguish among the elements of this 
ranked subvector. However, comparisons are made only among a subset of elements of 
this ranked vector. While the elements of the entire global PageRank vector may be tightly 
packed in some sections of the (0,1) interval, elements of the subset related to a particular 
query are much less densely packed. Therefore, extreme accuracy on the order of 10~!? is 
most likely unnecessary for this application. 


The fact that Brin and Page report reasonable estimates for 77 after only 50 itera- 
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tions of the power method on a matrix of order 322, 000, 000 has one of two implications: 
either (1) their estimates of x7 are not very accurate or (2) the subdominant eigenvalue of 
the iteration matrix is far removed from A, = 1. The first statement is a claim that outsiders 
not privy to inside information can never verify, as Google has never published information 
about their convergence tests. The implication of the second statement is that the “fudge 
factor” matrix E = ev’ must carry a good deal of weight and perhaps a is lowered to .8 
in order to increase the eigengap and speed convergence. By decreasing a and simulta- 
neously increasing the weight of the fudge factor, the transition probability matrix moves 
farther from the Web’s original hyperlink structure. 


8.4 DANGLING NODES 


When you begin large-scale implementation of PageRank, you must make a design de- 
cision about how you're going to deal with dangling nodes, and this decision will affect 
the PageRanks you produce. Every webpage is either a dangling node or a nondangling 
node. We first encountered dangling nodes in Chapter 4-the pages with no outlinks that 
caused the problem of rank sinks. All other pages, having at least one outlink, are called 
nondangling nodes. Dangling nodes exist in many forms. For example, a page of data, a 
page with a postscript graph, a page with jpeg pictures, a pdf document, a page that has 
been fetched by a crawler but not yet explored—these are all examples of possible dangling 
nodes. The more ambitious the crawl, the bigger the proportion of dangling nodes because 
the set of fetched but uncrawled pages grows quickly. In fact, for some subsets of the Web, 
dangling nodes make up 80% of the collection’s pages. 


The presence of these dangling nodes causes both philosophical and computational 
issues for the PageRank problem. To understand this, let’s recap how the PageRank model 
addresses dangling nodes. Google founders Brin and Page suggested replacing 07 rows of 
the sparse hyperlink matrix H_ with dense vectors (the uniform vector e7 /n or the more 
general v”’ vector) to create the stochastic matrix S. Of course, if this suggestion were to 
be implemented explicitly, storage requirements would increase dramatically. Instead, we 
showed in Chapter 4 how the stochasticity fix can be modeled implicitly with the construc- 
tion of one vector, the dangling node vector a . Element a; = 1 if row i of H corresponds 
to a dangling node, and 0, otherwise. Then S (and also G) can be written as a rank-one 
update of H. 


S=H+av’, and therefore, G=aS+(1—a)ev’ 
=aH+(aa+(l1—a)e)v’. 


The PageRank power method 
whtDT — ga e®TH + (ar Ta+1—a)v" (8.4.1) 


is then applied to compute the PageRank vector. (The Matlab code for the PageRank power 
method is given in the box on page 51.) However, this is not exactly the way that Brin 
and Page originally dealt with dangling nodes [38, 40]. Instead, they suggest “removing 
dangling nodes during the computation of PageRank, then adding them back in after the 
PageRanks have converged” [38], presumably for the final few iterations [102]. 


This brings us to the philosophical issue of dangling nodes. To dangle or not to dan- 
gle: that is the question. And it’s not an easy one to answer. We warn you—answer care- 
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fully or face discrimination charges. Leaving dangling nodes out somehow feels morally 
wrong. Arguing in a utilitarian vein that dangling nodes can’t be that important anyway 
certainly is scientifically wrong. A dangling node with lots of inlinks from important pages 
has just as much right to a high PageRank as a nondangling node, and shouldn’t be tossed 
aside (the way Brin and Page suggested) as matter of algorithmic convenience. Indeed, 
this was confirmed experimentally by Kevin McCurley, one of the first scientists to boldly 
explore the Web Frontier (Kevin’s name for the set of dangling nodes, since many dan- 
gling nodes are yet to-be-crawled pages). He showed on small graphs as well as enormous 
graphs that some dangling nodes can have higher rank than nondangling nodes [66]. Re- 
moving the dangling nodes completely can cause even more problems. The process of 
removing these nodes can itself produce new dangling nodes. If this process is repeated 
until no dangling nodes remain, it’s possible in theory (although unlikely) that no nodes 
remain. Further, removing the dangling nodes amounts to unnecessarily removing a great 
deal of useful data. 


Excluding the dangling nodes from the start, then trying to make it up to them later 
(Brin and Page’s solution) also feels wrong. In fact, the dangling node gets a treatment 
similar to the Native American. Further, the exclusion/correction procedure biases all the 
PageRank values of nondangling and dangling nodes alike, and unnecessarily so. 


A better solution is to treat all nodes fairly from the start. Include the dangling nodes, 
but be aware of their unique talents. That’s exactly the solution proposed by three groups 
of researchers, Lee et al. [112], McCurley et al. [66], and yours truly (authors Carl and 
Amy) [109]. We note that our first solution to the PageRank problem (represented by the 
power method of equation 8.4.1 and the linear system of equation 7.2.1) treats all nodes 
fairly from the start, but doesn’t capitalize on the unique potential of the dangling nodes. 
We describe this potential in the next few paragraphs. 


Stanford graduate student Chris Lee and his colleagues noticed that, for the most 
part, all dangling nodes look alike; at least their rows in H (and S and G) do [112]. 
And further, whenever the random surfer arrives at a dangling node, he always behaves 
the same. Regardless of the particular dangling node he’s currently at, he always teleports 
immediately to a new page (at random if v? = e7 /n or according to the given teleportation 
distribution if v’ 4 e”/n). If that’s the case, Lee et al. thought, why not lump the 
individual dangling nodes together into one new state, a teleportation state. This reduces 
the size of the problem greatly, especially if the proportion of dangling nodes is high. 
However, solving the smaller (|ND| + 1) x (|ND| + 1) system, where |ND] is the number 
of nondangling nodes, creates two new problems. First, ranking scores are available only 
for the nondangling pages plus the one lumped teleportation state. Second, this smaller set 
of rankings is biased. The question is: how can we recover the scores for each dangling 
node and remove the bias in the ranks? While Lee et al.’s answer to this question can 
be explained by the mathematical techniques of aggregation [51, 56, 92, 151, 154, 155] 
and stochastic complementation [125], we present an alternative answer, which is easier to 
follow and was inspired by the linear system formulation of McCurley et al. [112]. 


Suppose the rows and columns of H are permuted (i.e., the indices are reordered) so 
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that the rows corresponding to dangling nodes are at the bottom of the matrix. 


where ND is the set of nondangling nodes and D is the set of dangling nodes. The coefficient 
matrix in the sparse linear system formulation of Chapter 7 (i.e., x7 (I — aH) = v? with 


am? — x/x"e) becomes 


= I- aH, —aHj2 


and the inverse of this matrix is 


(I aa aH)-? _ (I — aH;,)"! a(I _ aH)" 'Hye2 
0 I 
Therefore, the unnormalized PageRank vector x? = v? (I — aH)~! can be written as 
x? =(v?(I-aHy,)"! | av? -—oHy)'Hy+vi), 


where the personalization vector v’ has been partitioned accordingly into nondangling 
(v7) and dangling (v3) sections. Note that I — aH, inherits many of the properties of 
I — aH from Chapter 7, most especially nonsingularity. In summary, we now have an 
algorithm that computes the PageRank vector using only the nondangling portion of the 
web, exploiting the rank-one structure (and therefore lumpability) of the dangling node 
fix. 


DANGLING NODE PAGERANK ALGORITHM 


1. Solve for xf in x? (I — aH,) = v?. 


2. Compute x3 = ax? Hi. + vi. 


3. Normalize mw? = [x7 x2]/||[x7 x#]|[1. 


This algorithm is much simpler and cleaner, but equivalent to the specialized iterative 
method proposed by Lee et al. [112], which exploits the dangling nodes to reduce compu- 
tation of the PageRank vector by a factor of 1/5 on a graph in which 80% of the nodes are 
dangling. While this solution to the problem of dangling nodes gives them fair treatment 
and capitalizes on their unique properties, we can do even better. 


Inspired by the dangling node PageRank algorithm above, we wondered if a deeper 
search for “sub-dangling” nodes might help further. That is, if the presence of dangling 
nodes, and therefore, O? rows in H is so advantageous, can we find more 07 rows in 
submatrices of H? In fact, in [109], we proposed that the process of locating zero rows be 
repeated recursively on smaller and smaller submatrices of H, continuing until a submatrix 
is created that has no zero rows. For example, consider executing such a process on a 
hyperlink matrix H that has 9664 rows and columns and contains 16773 nonzero entries in 
the positions indicated in the left-hand side of Figure 8.3. The process amounts to a simple 
reordering of the states of the Markov chain. The left pane shows the nonzero pattern in 
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Figure 8.3 Original and reordered H for sample web hyperlink matrix 


H, and the right pane is the nonzero pattern after the rows of H are reordered according to 
the recursive dangling node idea. 


In general, after this symmetric reordering, the coefficient matrix of the linear system 
formulation of the PageRank problem of equation (7.2.1) has the following structure. 


I- aH, —aHj2 —aH 43 ere —aH ip 

I —aHo3 Le —aHoy, 

(I — aH) = I nde! — OTs 
I 


where b is the number of square diagonal blocks in the reordered matrix. Thus, the re- 
ordered system can be solved by forward substitution. The only system that must be solved 
directly is the first subsystem, x} (I— ~@H,1) = v7, where 77 and v” have also been par- 
titioned accordingly. The remaining subvectors of x” are computed quickly and efficiently 
by forward substitution. 


DANGLING NODE PAGERANK ALGORITHM 2 


1. Reorder the states of the original Markov chain, so that the reordered matrix has the 
structure given above. 


2. Solve for x7 in x} (I — aH) = v?. 


3. For i = 2 to b, compute x? = a ppia xPHyi+v;. 


4. Normalize m7 = [xP x3 --- xf J/|\[k7 xP --- x? Ila. 


In the example from Figure 8.3, a 2,622 x 2,622 system can be solved instead 
of the full 9,664 x 9,664 system. The small subsystem x}(I — ~wHi,) = vj can be 
solved by a direct method (if small enough) or an iterative method (such as the Jacobi 
method). Reference [109] provides further details of the reordering method along with 
experimental results, suggested methods for solving the x7 (I — aH,,) = vj system, 
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and convergence properties. Fortunately, it turns out that this dangling node PageRank 
algorithm has the same asymptotic rate of convergence as the original PageRank algorithm 
of equation (5.3.1), which means that because it operates on a much smaller problem it can 
take much less time than the standard PageRank power method, provided the reordering 
can be efficiently implemented. 


8.5 BACK BUTTON MODELING 


Related to the topic of dangling nodes is the issue of the back button. Often times during 
a PageRank talk at a scientific conference, right after we’ve introduced dangling nodes 
and their problems and solutions, we are asked, “But what about the browser’s back but- 
ton? How does PageRank account for this button?” The short answer is that, as originally 
conceived, the PageRank model does not allow for a back button. Our questioner usually 
doesn’t give in so easily, “Whenever I’m surfing and I enter a dangling node, I simply back 
my way out until I can proceed with forward links again.’ We concede—that’s exactly 
what most surfers do. However, accounting for the back button complicates the mathemat- 
ics of the PageRank model. In fact, the defining property of a Markov chain is that it’s 
memoryless. That is, upon transitioning and arriving at a new webpage, the chain does not 
remember from whence it came. Therefore, one way to model the back button would be to 
add memory to the Markov chain. Unfortunately, this quickly obscures the elegant mathe- 
matical and computational beauty of the Markov chain. Nevertheless, several researchers 
have proceeded in this direction [67, 119, 157], hoping that the increase in complexity is 
offset by the back button’s ability to more accurately capture true Web surfing behavior. 


There are many ways to model the back button on a Web browser. We propose one 
very simplistic approach that incorporates limited back button usage into the PageRank 
model yet still stays in the Markov framework. In this model, once the random surfer 
arrives at a dangling node, he immediately returns to the page he came from. It’s important 
to note that this bounce-back feature simulates the back button only for dangling nodes. 
Unfortunately, in order to achieve this bounce back, we need to add a new node for every 
inlink into each dangling node. However, the resulting, larger hyperlink matrix, which 
we call H, has some nice structure. To understand the bounce-back model, consider an 
example based on Figure 8.4. The hyperlink matrix H associated with Figure 8.4 is 


Figure 8.4 Original 6-node graph for back button model 
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The modified graph with bounce back capability appears in Figure 8.5. The modifi- 
cations are shown with dashes. Thus, the bounce-back hyperlink matrix H is 


g 


Figure 8.5 Bounce-back 6-node graph for back button model 
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H is now stochastic, so no artificial stochastic fix is needed. However, eventually an irre- 
ducibility fix must still be applied. Execute the following steps to create H, the stochastic 
hyperlink matrix for the bounce back model. (Note that H could be called S.) 


ND D 
Ay, Hy 


e Reorder H so that H = mass 
D 0 0 


) . See section 8.4. 

e For each inlink into a dangling node, create a bounce-back node. There will be 
nnz(Hy2) of these bounce-back nodes instead of the |D| nodes in the dangling node 
set. If each dangling node has more than one inlink and there are many dangling 
nodes, this could drastically increase the size of the matrix. The bounce-back hyper- 
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link matrix has the following block form. 


ND BB 
—= _ ND Hy, Hi. 
Hi BB Ge 0 ). 


Form the three nonzero blocks of H. First, H,, = Hj. Second, there is structural 
symmetry between H 5 and Hy», that can be exploited. That is, if element (i, 7) of 
H)» is nonzero, then element (j, t) of H., = 1. Further, while the size of H can be 
much larger than the size of H, H only has nnz(Hi2) more nonzeros than H and 
all of these are the integer 1. As a result of this nice structure, the Matlab commands 
find and sparse can be used to create the H,, and Hz», blocks. 

[r,c,v]=find (Hi2); 

Hi. =sparse(r,1:nnz(Hi2),v); 


Ho =(Hi2 > 0)’; 


To compute the bounce-back PageRank vector, simply run any PageRank algorithm 
such as the original algorithm of equation 4.6.1 on page 40 or the accelerated versions of 
Chapter 9 on G = aH + (1—a)ev". Of course, the algorithms are slightly modified due 
to the fact that H is now also stochastic. Thus, the bounce-back PageRank power method 
is 

qRtDT _ AHTE 
=a7n® TH + (1 —a)v? 


The bounce-back PageRank vector for H is longer than the standard PageRank vector for 
H. To compare the two vectors, simply collapse multiple bounce-back nodes for each 
dangling node back into one node. For the above example, with a = .85 and v7 = e” /n, 


1 2 3 4 5 6 
n™(H)= (0.1726 0.1726 0.2102 0.1726 0.0993 0.1726) and 
1 ” 3 4 53 63 64 


mw" (H)= (0.1214 0.1214 0.2846 0.2186 0.0698 0.0698 0.1143). 


The collapsed vector 77 = (0.1214 0.1214 0.2846 0.2186 0.0698 .1841). The 
ranking of pages (from most to least important) associated with m7 is (3 1/2/4/6 5), 
while the ranking associated with 77 is (3. 4 6 1/2 5), where the / symbol indi- 
cates a tie. Of course, on such a small example the difference in the two rankings is appar- 
ent. Much larger experiments are needed to determine the value of bounce-back PageRank 
as an alternative ranking. 


ASIDE: Google’s Initial Public Offering 


Speculation and rumors about Google’s initial public offering (IPO) of stock shares 
began in 2003. On August 1, 2004, Google issued a press release about their IPO. True to 
their founding principles, Google’s IPO was original. Google used a Dutch auction to take 
bids from investors. For the auction, investors submitted a bid with the price and number of 
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shares they were willing to buy. Then Google and its underwriting bankers, Morgan Stanley 
and Credit Suisse Group First Boston determined the clearing price, which is the highest price 
at which there is a demand for all of the 24.6 million shares. The offer price was then set at 
or below the clearing price. Google believed the Dutch auction was the best way to level the 
playing field and allow small individual investors and large corporate investors equal access to 
shares. Google expected the offer price to fall somewhere between $108 and $135. On July 31, 
2004, the IPO information website, www. ipo.google.com opened. This site contained a 
100-plus page prospectus that informed prospective investors about the risk factors, auction 
process, company history and mission, search trends, and financial data. It also contained a 
Meet the Management presentation in which the company’s leaders, founders Sergey Brin and 
Larry Page, CEO Eric Schmidt, and CFO George Reyes, summarize some of the main issues 
in the detailed prospectus. Google shares ended up selling on August 19, 2004 for $85 each, 
bringing in over $1.1 billion for the company and making it the biggest technology IPO in 
history and the 25th largest IPO overall. You can track the price of Google shares by watching 
the Nasdaq ticker symbol GOOG. 
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Chapter Nine 


Accelerating the Computation of PageRank 


People have a natural fascination with speed. Look around; articles abound on Nascar and 
the world’s fastest couple—Marion Jones and Tim Montgomery—speedboat racing and 
speed dating, fast food and the Concorde jet. So the interest in speeding up the computation 
of PageRank seems natural, but actually it’s essential because the PageRank computation 
by the standard power method takes days to converge. And the Web is growing rapidly, so 
days could turn into weeks if new methods aren’t discovered. 


Because the classical power method is known for its slow convergence, researchers 
immediately looked to other solution methods. However, the size and sparsity of the web 
matrix create limitations on the solution methods and have caused the predominance of the 
power method. This restriction to the power method has forced new research on the often 
criticized power method and has resulted in numerous improvements to the vanilla-flavored 
power method that are tailored to the PageRank problem. Since 1998, the resurgence in 
work on the power method has brought exciting, innovative twists to the old, unadorned 
workhorse. As each iteration of the power method on a web-sized matrix is so expensive, 
reducing the number of iterations by a handful can save hours of computation. Some of the 
most valuable contributions have come from researchers at Stanford who have discovered 
several methods for accelerating the power method. There are really just two ways to 
reduce the work involved in any iterative method: either reduce the work per iteration or 
reduce the total number of iterations. These goals are often at odds with one another. That 
is, reducing the number of iterations usually comes at the expense of a slight increase in 
the work per iteration, and vice versa. As long as this overhead is minimal, the proposed 
acceleration is considered beneficial. In this chapter, we review three of the most successful 
methods for reducing the work associated with the PageRank vector. 


9.1 AN ADAPTIVE POWER METHOD 


The goal of the PageRank game is to compute 77, the stationary vector of G, or tech- 
nically, the power iterates 7) such that ||j#%)7 — w@-DT||, < 7, where r is some 
acceptable convergence criterion. Suppose, for the moment, that we magically know a” 
from the start. We’d, of course, be done, problem solved. But, out of curiosity, let’s run 
the power method to see how far the iterates +)? are from the final answer 77. We 
want to know what kind of progress the power method is making throughout the iteration 
history. There are several ways to do this. You can take a macroscopic view and look 
at how far 7"), the current iterate, is from 77, the magical final answer, by computing 
||)? — a7 |\|,. By using the norm, the individual errors in each component are lumped 
into a single scalar which gives the aggregated error. The standard power method takes the 
macroscopic view at each iteration, using a convergence test that looks at an aggregated er- 
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ror, ||")? — 7(*-YT ||, Another idea is to take a microscopic view and look at individual 
components in the two vectors, examining how far 7\*) is from 7; at each iteration. This is 
exactly what Stanford researchers Sep Kamvar, Taher Haveliwala, Gene Golub, and Chris 


Manning did [102]. 


Kamvar et al. [102] noticed that some pages converge to their PageRank values faster 
than other pages. However, the standard power method with its macroscopic view doesn’t 
notice this, and blindly charges on, making unnecessary calculations. In fact, the Stanford 
group found that most pages converge to their final PageRank values quickly. The power 
method is forced to drag on because a small proportion of obstinate pages take longer to 
settle down to their final PageRank values. As elements of the PageRank vector converge, 
the adaptive PageRank method “locks” them and does not compute them in subsequent 
iterations. But how do you know which elements to lock and when? In the case when 
we magically know 27, lock element i when [{*) —1;| < €, where ¢ is the microscopic 
convergence tolerance. (Kamvar et al. used € = 107%.) In practice, lock element i when 
|{*) = ae < ¢,ie., the difference in successive iterates is small enough. 

This adaptive power method provides a modest speedup in the computation of the 
PageRank vector, i.e., 17% on Kamvar et al.’s experimental datasets. However, while this 
algorithm was shown to converge in practice on a handful of datasets, there are serious open 
theoretical issues with the algorithm. For instance, there is no proof regarding convergence 
of the algorithm; the algorithm may or may not converge. And even if it does converge, 
the final answer may not be right. Because only short-run dynamics are considered in 
the locking decision, it’s not clear whether the algorithm converges to the true PageRank 
values or some gross approximation of them. In fact, nearly uncoupled chains are known 
to exhibit short-run stabilization in each cluster, which is then followed by a period of 
progress toward the global equilibrium. Further, the final global equilibrium often does not 
resemble properties of the short run equilibria, meaning the adaptive method could stop too 
soon with a grossly inaccurate answer for an uncoupled chain. Nevertheless, the adaptive 
algorithm makes a practical contribution to PageRank acceleration by attempting to reduce 
the work per iteration required by the power method. 


9.2 EXTRAPOLATION 


Another acceleration method proposed by the same group of Stanford researchers aims 
to reduce the number of power iterations. The expected number of power iterations is 
governed by the size of the subdominant eigenvalue \2. The idea of extrapolation goes 
something like this: “if the subdominant eigenvalue causes the power method to sputter, 
cut it out and throw it away.” To understand what this means, let’s look at the power iterates 
using special spectral decomposition goggles. Spectral decomposition goggles are a bit 
like x-ray vision in that they allow one to see deep into a matrix to examine its spectral 
components. For simplicity, assume that G is diagonalizable and 1 > |A2| > --- > |An|. 
Then, the power iterates look like 


ahT — gk-DTG = xOT EG 
=m + AZ VVD + ARIS to + ANIAY ns ey 
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where x; and y; are the right-hand and left-hand eigenvectors of G corresponding to \; and 
4, = rOT,. It’s frustrating—at each iteration the desired PageRank vector 7” is sitting 
right there, taunting us. In fact, equation (9.2.1) shows that AS~2y3 does the spoiling for 
the power method. 27 is hidden until 44 — 0, which takes a while when |g| is large. The 
technique of extrapolation removes the spoiler. Notice that 


ROP — Moroyd =m" + Abysyg t-+- + Many, 


which is closer to the correct PageRank 77 when |A2| > |A3|. This means that if we 
could subtract \¥72y4 from the current iterate we could propel the power method for- 
ward. However, the problem is how to compute \S72y3. When we take off the spectral 
decomposition goggles, the spectral components are lumped together and we see only one 
vector, w")?, Fortunately, we can estimate \Sy2y3 by using things we do have, or can 
get, wkt2)T  a(k+DT and 2"), Kamvar et al. have shown that 


es ((BHWT — ap (k)T).2 
272¥2 ~ mkt+2)T — Oo (kK+DT — 7 (k)T’ 


where (x):? indicates component-wise squaring of elements in the vector (*). Since extrap- 
olation requires additional computation (getting and storing the two subsequent iterates), 
it should only be applied periodically, say every 10 iterations. Unfortunately, this method, 
which is referred to as Aitken extrapolation because it is derived from the classic Aitken A? 
method for accelerating linearly convergent sequences, gives only modest speedups. One 
reason concerns the eigenvalue A3. If A2 and \3 are complex conjugates, then |A2| = |3| 
and Aitken extrapolation performs poorly. 


Kamvar et al. developed an improved extrapolation method, called quadratic ex- 
trapolation, which while more complicated, is based on the same idea as Aitken extrap- 
olation. That is, “if Az and Az cause you problems, cut them both out and throw them 
away.” On the datasets tested, quadratic extrapolation reduces PageRank computation time 
by 50-300% with minimal overhead. Figure 9.1 compares the residuals when the stan- 
dard power method and the power method with quadratic extrapolation are applied to a 
small web graph. In this example, quadratic extrapolation is applied every 20 iterations. 


* power method 
+ power method with quad. extrap. 


Ig joresidual 
lo) 
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iteration 


Figure 9.1 Residual plot for power method vs. power method with quadratic extrapolation 
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Notice how the iterate that results at each application of quadratic extrapolation makes dra- 
matic progress toward the solution. Unfortunately, quadratic extrapolation is expensive and 
can be done only periodically. Researchers, such as extrapolation expert Claude Brezin- 
ski, have recently begun experimenting with other classic extrapolation methods, such as 
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Chebyshev and e-extrapolation. 


Matlab m-file for PageRank Power Method with Aitken extrapolation 


This m-file implements the PageRank power method applied to the Google matrix 
G = aS + (1 —a)ev? with Aitken extrapolation applied every ‘1° iterations. 


function 


AITK! 


EXAMPLE: 


INPUT: 


OUTPUT: 


[pi, time, numiter]=aitkenPageRank (pi0,H,v,n,alpha,epsilon,1); 


ENPageRank computes the PageRank vector for an n-by-n Markov 


matrix H with starting vector pi0 (a row vector), 
scaling parameter alpha (scalar), and teleportation 
vector v (a row vector). Uses power method with 
Aitken extrapolation applied every 1 iterations. 


{pi, time, numiter]=aitkenPageRank (pi0,H,v,900,.9,1e-8,10); 


piO = starting vector at iteration 0 (a row vector) 

H = row-normalized hyperlink matrix (n-by-n sparse matrix) 
v = teleportation vector (1-by-n row vector) 

n = size of P matrix (scalar) 

alpha = scaling parameter in PageRank model (scalar) 

epsilon = convergence tolerance (scalar, e.g. 1le-8) 

1 = Aitken extrapolation applied every 1 iterations (scalar) 


pi = PageRank vector 
time = time required to compute PageRank vector 
numiter = number of iterations until convergence 


The starting vector is usually set to the uniform vector, 

pid=1/n*ones(1,n). 

NOTE: Matlab stores sparse matrices by columns, so it is faster 
to do some operations on H’, the transpose of H. 


get "a" 


vector, where a(i)=1, if row i is dangling node 


and 0, o.w. 


rowsumvector=ones(1,n)*H’'; 
nonzerorows=find(rowsumvector) ; 
zerorows=setdiff(1:n,nonzerorows); l=length(zerorows) ; 
a=sparse(zerorows,ones(1,1),ones(1,1),n,1); 


k=0; 
residual=1; 
pi=pi0; 
tic; 


while 


(residual >= epsilon) 
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prevpi=pi; 

k=k+1; 

pi=alpha*pi*H + (alpha*(pi*a)+1l-alpha) *v; 
residual=norm(pi-prevpi,1); 


if (mod(k,1))==0 
% ‘Aitken extrapolation’ 
nextpi=alpha*pi*H + (alpha* (pi*a)+1l-alpha) *v; 
g=(pi-prevpi) .*2; 
h=nextpi-2*pit+prevpi; 
nextpi=prevpi-(g./h); 
if (any (nextpi==-Inf)==1) 
pi=pi; 
else 
pi=nextpi; 
end 
%’'end Aitken extrapolation’ 
end 
end 
numiter=k; 
time=toc; 
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Matlab m-file for PageRank Power Method with quadratic extrapolation 


This m-file implements the PageRank power method applied to the Google matrix 
G = aS+(1—a)ev" with quadratic extrapolation applied every ‘1° iterations. 


function [pi,time,numiter]=quadPageRank(pi0,H,v,n,alpha,epsilon,1); 


dP dP dP AP dP ADP dP DP dP dP DP dP DP ADP AP GP ADP dP DP dP cP OP WP oP 


oe 


QUADPageRank computes the PageRank vector for an n-by-n Markov 


matrix H with starting vector pi0 (a row vector), 
scaling parameter alpha (scalar), and teleportation 
vector v (a row vector). Uses power method with 


quadratic extrapolation applied every 1 ("ell") iterations. 


EXAMPLE: [pi,time,numiter]=quadPageRank (pi0,H,v,900,.9,1e-8,10); 


INPUT: pi0 = starting vector at iteration 0 (a row vector) 


H = row-normalized hyperlink matrix (n-by-n sparse matrix) 

v = teleportation vector (1-by-n row vector) 

n = size of P matrix (scalar) 

alpha = scaling parameter in PageRank model (scalar) 

epsilon = convergence tolerance (scalar, e.g. 1le-8) 

1 ("ell") = quadratic extrapolation applied every 1 ("ell") 
iterations (scalar) 


OUTPUT: pi = PageRank vector 


The 


pid= 
NOTE: 


time = time required to compute PageRank vector 
numiter = number of iterations until convergence 


starting vector is usually set to the uniform vector, 
1/n*ones(1,n). 

Matlab stores sparse matrices by columns, so it is faster 
to do some operations on H’, the transpose of H. 


94 CHAPTER 9 


% get "a" vector, where a(i)=1, if row i is dangling node 
% and 0, o.w. 


rowsumvector=ones(1,n)*H’; 
nonzerorows=find(rowsumvector) ; 
zerorows=setdiff(1:n,nonzerorows); l=length(zerorows) ; 
a=sparse(zerorows,ones(1,1),ones(1,1),n,1); 


k=0; 
residual=1; 
pisp10? 
tic: 


while (residual >= epsilon) 
prevpi=pi; 
k=k+1; 
pi=alpha*pi*H + (alpha*(pi*a)+1l-alpha) *v; 
residual=norm(pi-prevpi,1); 
if (mod(k,1))==0 
% ‘quadratic extrapolation’ 
nextpi=alpha*pi*H + (alpha* (pi*a)+1-alpha) *v; 
nextnextpi=alpha*nextpi*H + (alpha* (nextpi*a)+1-alpha) *v; 


y=pi-prevpi; nexty=nextpi-prevpi; nextnexty=nextnextpi-prevpi; 
Y=[y’ nexty’]; 
gamma3=1; 
% do modified gram-schmidt QR instead of matlab’s [Q,R]=qr(Y); 
[m, n] = size(Y); 
Q = zeros(m,n); 
R = zeros(n); 
for j=l:n 
R(j,j) = norm(Y(:,3)); 
Q(:,j3) = Y(:,j)/R G 3); 
R(j, jel: n) = Q(: Sr asuae 
Y(:,j+l:n) = Y(: opis n) - Q(:,3)*R(j,jti:n); 


end 
Qnextnexty=Q’' *nextnexty’; 
gamma2=-Qnextnexty(2)/R(2,2); 
gammal=(-Qnextnexty (1) -gamma2*R(1,2))/R(1,1); 
gamma0=- (gammal+gamma2+gamma3) ; 
beta0=gammal+gamma2+gamma3; 
betal=gamma2+gamma3 ; 
beta2=gamma3; 
nextnextpi=beta0*pi+betal*nextpit+tbeta2*nextnextpi; 
nextnextpi=nextnextpi/sum(nextnextpi) ; 
pi=nextnextpi; 
%'end quadratic extrapolation’ 
end 
pi=pi/sum(pi); 
end 
numiter=k; 
time=toc; 


9.3 AGGREGATION 


The same group of Stanford researchers, Kamvar et al. [101] has produced one more 
contribution to the acceleration of PageRank. This method works on both acceleration 
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goals simultaneously, trying to reduce both the number of iterations and the work per 
iteration. This very promising method, called BlockRank, is an aggregation method that 
lumps sections of the Web by hosts. BlockRank begins by taking the webgraph (where 
nodes represent webpages) and compresses this into a hostgraph (where the nodes represent 
hosts). Hosts are the high-level webpages like www.ncsu. edu, under which lots of other 
pages sit. Most pages within a host intralink to other pages within the host, but a few links 
are interhost links, meaning they link between hosts. In the global hostgraph, intralinks are 
ignored. When the PageRank model is applied to the small hostgraph, a HostRank vector 
is output. The HostRank for host 2 gives the relative importance of that host. While the 
HostRank problem is much smaller than the original PageRank problem, it doesn’t give us 
what we want, which is the importance of individual pages, not individual hosts. In order 
to get one global PageRank vector, we first compute many local PageRank vectors—the 
PageRank vector for pages in each individual host. Now only the intralinks are used, and 
the interlinks are ignored. This is an easy computation since hosts generally have less than 
a few thousand pages. Thus, the PageRank model is applied to each host, www. ncsu. edu, 
www.msmary.edu, www.cofc.edu, and so on. At this point, there is one global 1 x |H| 
HostRank vector, where || is the number of hosts, as well as | H| local PageRank vectors, 
each 1 x |H;| in size, where |H;| is the number of pages in host H;. To approximate the 
global PageRank vector, simply multiply the local PageRank vector for host H; by the 
probability of being in that host, given by the i*” element of the HostRank vector. This is 
called the expansion step. 


This method gives an approximation to the true PageRank vector that the power 
method computes. It’s an approximation because at each step some links are ignored, 
which means that valuable information is lost in the compression or so-called aggrega- 
tion step. Fortunately, this approximation can be improved if, in an accordion style, the 
collasping/expanding process is repeated until convergence. BlockRank is actually just 
classic aggregation [51, 56, 92, 151, 155] applied to the PageRank problem. (See sections 
10.3-10.5 for more on aggregation.) This method often reduces both the number of itera- 
tions required and the work per iteration. It produced a speedup of a factor of 2 on some 
datasets used by Kamvar et al. More recent, but very related, algorithms [42, 116] use sim- 
ilar aggregation techniques to exploit the Web’s structure to speed ranking computations. 


EXAMPLE In order to understand the basic principles of aggregation used by the 
BlockRank algorithm, consider the nearly uncoupled chain of Example 2 from Chapter 
6. The 7-node graph is reproduced below in Figure 9.2. Clearly, nodes 1, 2, 3, and 7 can 
be considered as one host (called Host 1), due to their strong interaction. Nodes 4, 5, and 6 
then make up Host 2. The BlockRank algorithm aggregates the 7-node graph into a smaller 
2-node graph of hosts. The transition matrix associated with this host graph is 


The HostRank vector associated with the host graph, the stationary vector of the Google 
matrix for the host graph, is (.3676 .6324) (here we used a = .9 andv? =(.5 .5)). 
This means that 36.76% of the time we expect the random surfer to visit the states of Host 
1, ie., webpages 1, 2, 3, and 7. 
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Figure 9.2 Nearly uncoupled graph for web of seven pages 


Next, local PageRank vectors are computed for each host. For Host 1, the hyperlink 
matrix is 
Py Py Py Pe 
Py 0 1 0 0 
Py 0 0 1 0 
P31 1/3 1/3 0 1/3 
Py 0 0 0 0 


Only within-host links are used to create H1, all intrahost links are ignored, namely, the 
link 3 — 4. Witha = 9 andv7 = (.25 .25  .25  .25) the local PageRank vector 
for Host lis(.1671 .3175 .3483  .1671). The interpretation of the second element of 
this vector is that, given the random surfer is in the states of Host 1, 31.75% of the time he 
visits webpage 2. Similarly, the local hyperlink matrix for Host 2 is 


Py Ps Pe 

Py, 0 1 0 

H, = Ps; 0 0 1 
PB\1 0 0 


And the local PageRank vector for Host 2 is (1/3 1/3 1/3). 


The final step is the disaggregation step, which uses these three small vectors to 


create a 1 x 7 vector 7’ that approximates the exact PageRank vector 7”. 


1 2 3 7 4 5 6 
#T =.3676 ((.1671 .3175 .3483 1671) .6324 (1/3 1/3 1/3)) 
1 2 3 is 4 5 6 


= (.0614 1167 1280 .0614 .2108 .2108 .2108 ). 


ve 


Compare this with the exact PageRank vector 7° computed by the power method. 


1 2 3 6 4 5 6 
n= ( .0538 .1022) 1132) .05388 = .2271 = .2256 2242). 
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Classic aggregation methods are known to work well and reduce effort when computing 
the stationary vector for a nearly uncoupled Markov chain. The web chain is somewhat 
uncoupled so BlockRank works well so long as an appropriate level of host aggregation is 
done. 


9.4 OTHER NUMERICAL METHODS 


Yet another group of researchers from Stanford, joined by IBM scientists, dropped the 
restriction to the power method. In their short paper, Arasu et al. [10] provide one small 
experiment with the Gauss-Seidel method applied to the PageRank problem. Bianchini et 
al. [29] suggest using the Jacobi method to compute the PageRank vector. Golub and Greif 
also conduct some experiments with the Arnoldi method [81]. 


Another promising avenue for PageRank acceleration recently began receiving aca- 
demic attention: parallel processing. Daniel Szyld and his colleagues have conducted 
experiments that execute the PageRank power method in parallel with very little overhead 
communication between processors. Others have corroborated the benefits and particular 
challenges of parallel processing for PageRank computation [80, 118]. 


Despite this progress, these are just beginnings. If the holy grail of real-time per- 
sonalized search is ever to be realized, then drastic speed improvements must be made, 
perhaps by innovative new algorithms, or the simple combination of many of the current 
acceleration methods into one algorithm. 


ASIDE: Google API 


In April 2002, Google released its Web Application Programming Interface (API), 
which provides fans a free (for now) and legal way to access their search results with auto- 
mated queries. (Without the API, automated querying is against Google’s Terms of Service.) 
By doing this, Google let the world’s programmers virtually run free in Google labs. Google 
suddenly had thousands of free employees, some more productive and generous than others, 
creating new services and applications of Google and offering to give them back to the public. 
For example, four products from API programmers are available at http: //www. tele- 
pro.co.uk/scripts/google/. Developers are free to publish their results as long as 
they are for noncommercial purposes. Software developers interested in the API download the 
free developer’s kit, create an account, and get a license key. The license key allows a devel- 
oper 1,000 queries a day (which explains why the API-generated application, RankPulse of 
the aside on page 65, tracks exactly 1,000 terms). With the key, developers are free to exper- 
iment with ways of accessing the standard Google index (which does not include the image, 
news, shopping, or other special-purpose indexes). For example, the book Google Hacks (see 
box on page 73) provides API code for adding to any webpage a small box of Google results 
for your chosen query that are refreshed daily. Other developers anticipate using the API to 
create applications that search both traditional library catalogs as well as the entire Web from 
a single command. Of course, with access to one of the world’s largest indexes, the API is an 
excellent way for web ranking researchers to test their new algorithms. 
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Chapter Ten 
Updating the PageRank Vector 


Every month a famous dance takes place on the Web. While there have been famous 
dances throughout modern history—the Macarena, the Mambo #5, the Chicken Dance— 
this dance is the first to have a profound impact on the search community. Every month 
search engine optimizers (SEOs) watch the Google Dance carefully, anxious to see if any 
steps have changed. Sometimes the modifications are easy to roll with, other times they 
cause a stir. 


The Google Dance is the nickname given to Google’s monthly updating of its rank- 
ings. We begin with some statistics that emphasize the need for updating rankings fre- 
quently. A study by web researchers Junghoo Cho and Hector Garcia-Molina [52] in 2000 
reported that 40% of all webpages in their dataset changed within a week, and 23% of 
the .com pages changed daily. In a much more extensive and recent study, the results of 
Dennis Fetterly and his colleagues [74] concur. About 35% of all webpages changed over 
the course of their study, and also pages that were larger in size changed more often and 
more extensively than their smaller counterparts. In the above studies, change was defined 
as either a change in page content or a change in page outlinks or both. Now consider 
news webpages, where updates to both content and links might occur on an hourly basis. 
Both the content score, which incorporates page content, and the PageRank score, which 
incorporates the Web’s graph structure, must be updated frequently to stay fresh. Ideally, 
the ranking scores would be as dynamic as the Web. Currently, it is believed that Google 
updates its PageRank vector monthly and possibly its content scores more often [7]. Con- 
sequently, researchers have been working to make updating easier, taking advantage of old 
computations to speed updated computations, and thereby making more frequent updating 
possible. 


In this chapter we focus on the mathematical problem associated with the Google 
Dance, specifically the issue of updating the PageRank vector. The phrase “updating 
PageRank” refers to the process of computing the new PageRank vector after monthly 
changes have been made to the Web’s graph structure. Between updates, thousands of 
links are added and removed, and thousands of pages are added and removed. The simplest, 
most naive updating strategy starts from scratch, that is, it recomputes the new PageRank 
vector making no use of the previous PageRank vector. To our knowledge, the PageRank 
vector for Google’s entire index is recomputed each month from scratch or nearly from 
scratch. (Popular sites may have their PageRank updated more frequently.) That is, last 
month’s vector is not used to create this month’s vector. A Google spokesperson at the 
annual SIAM meeting in 2002 reported that restarting this month’s power method with last 
month’s vector seemed to provide no improvement. The goal of updating is to beat this 
naive method. Surely, all that effort spent last month to compute PageRank has some value 
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toward computing this month’s PageRank with less work. 


The setup for the updating problem follows. Suppose that the PageRank vector 
go = (¢1, ¢2,---,¢m) for last month’s Google matrix Qy)xm is known (by prior com- 
putation), but the web graph requires updating because some hyperlinks have been altered 
or some webpages have been added or deleted. The updated Google matrix G,,», may 
have a different size than Q, i.e., m 4 n. The updating problem is to compute the up- 
dated PageRank a? = (71,72,..., 7m) for G by somehow using the components in go 
to produce 77 with less effort than that required by working blind (i.e., by computing 17 
without knowledge of $7). 


10.1 THE TWO UPDATING PROBLEMS AND THEIR HISTORY 


One fact that makes updating PageRank so challenging is that there are really two types of 
updates that are possible. First, when hyperlinks are added to or removed from the Web 
(or their weights are changed), the elements of the hyperlink matrix H change but the size 
of the matrix does not. If these are the only type of updates allowed, then the problem 
is called a link-updating problem. However, webpages themselves may be added to or 
removed from the Web. With this page-updating problem, states are added to or removed 
from the Google Markov chain, and the size of the Google matrix changes. Of the two 
updating problems, the page-updating problem is more difficult, and it generally includes 
the link-updating problem as a special case. (In section 10.6, we present a general-purpose 
algorithm that simultaneously handles both kinds of updating problems.) 


Since Markov chains and their stationary vectors have been around for nearly a cen- 
tury, the updating problem is not new. Researchers have been studying the problem for 
decades. History has followed theory; the easier link-updating problem has been studied 
much more extensively than the tougher page-updating problem. In fact, several solutions 
for link-updating already exist. In 1980, a theoretical formula for exact link-updating was 
derived in [129]. Unfortunately, the formula restricts updates so that only a single row of 
link-updates can be made to the Markov transition matrix. Thus, more general updates 
must be handled with a sequential one-row-at-a-time procedure. The idea is similar to the 
well-known Sherman—Morrison formula [127, p. 124] for updating a solution to a nonsin- 
gular linear system, but the techniques must be adapted to the singular matrix A = I— Q. 
The mechanism for doing this is by means of the group inverse A* for A, which is 
the unique matrix satisfying the three equations: AA*#A = A, A*AA# = A* and 
AA# = A#A,. This matrix is often involved in questions concerning Markov chains— 
see [46, 122, 127] for some general background and [46, 49, 50, 76, 83, 130, 122, 124, 
121, 126, 129] for Markov chain applications. 


The primary exact updating results from [129], as they apply to the PageRank prob- 
lem, are summarized below. 


Theorem 10.1.1. Let Q be the transition probability matrix of a Google Markov matrix 
and suppose that the i-th row q! of Q is updated to produce g' = q? — of” the 1-th 
row of G, which is the Google matrix of an updated Markov chain. If go and 7” denote 
the stationary probability distributions of Q and G respectively, and if A = I — Q, then 
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ne = og? — e", where 


| 6’ A® (At = the i-th column of A*). (10.1.1) 


To handle multiple row updates to Q, this formula must be sequentially applied one row at 
a time, which means that the group inverse must be sequentially updated. The formula for 
updating (I — Q)* to (I— G)* is as follows. 


At €T 
(I—G)* = A* + ee” [A® — 41] — aa, (10.1.2) 
TA# 
where y= a “1 and e is a column of ones. 


While these results provide theoretical answers to the link-updating problem, they 
are not computationally satisfying, especially if more than just one or two rows are in- 
volved. If every row is changed, then the formulas require O(n*) floating point operations. 


Other updating formulas exist [50, 77, 96, 104, 148], but all are variations of the 
same rank-one updating idea involving a Sherman—Morrison [127, p. 124] type of for- 
mula, and all are O(n?) algorithms for a general update. Moreover, all of these rank-one 
updating techniques apply only to the simpler link-updating problem, and they are not 
easily adapted to handle the more complicated page-updating problem. Consequently, the 
conclusion is that while the known exact link-updating formulas might be useful when only 
a row or two is changed and no pages are added or deleted, they are not computationally 
practical for making more general updates, and thus, because of the dynamics of the Web, 
are virtually useless for updating PageRank. The survey of the available solutions for the 
page-updating problem is even bleaker. No theoretical or practical solutions for the page- 
updating problem for a Markov chain exist. In light of the dynamics of the Web, updating 
PageRank is quite an important and open challenge. 


10.2 RESTARTING THE POWER METHOD 


It appears then that starting from scratch is perhaps the only alternative for the PageRank 
updating problem. Let’s begin our discussion with the simpler type of problem, the link- 
updating problem. Therefore, assume Q undergoes only link updates to create G. Sup- 
pose that the power method is applied to the new, updated Google matrix G, but the old 
PageRank vector go is used as the starting vector for the iterative process (as opposed to 
a random or uniform starting vector). Suppose that it is known that the updated stationary 
distribution +7 for G is in some sense close to the original stationary distribution go for 
Q. For example, this might occur if the perturbations to Q are small. It’s intuitive that if 
og’ and x? are close, then applying 


rtDT — _pOTE with OT = gF (10.2.1) 


should produce an accurate approximation to 77 in fewer iterations than that required 
when an arbitrary initial vector is used. To some extent this is true, but intuition generally 
overestimates the impact, as explained below. 


It’s well known that if A> is the subdominant eigenvalue of G, and if Az has index 
one (linear elementary divisors), then the asymptotic rate of convergence [127, p. 621] of 
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(10.2.1) is 
R= —1ogjo|Aal. (10.2.2) 


For linear stationary iterative procedures the asymptotic rate of convergence R is an in- 
dication of the number of digits of accuracy that can be expected to be eventually gained 
on each iteration, and this is independent of the initial vector. For example, suppose that 
the entries of G — Q are small enough to ensure that each component 7; agrees with ¢; in 
the first significant digit, and suppose that the goal is to compute the update 77 to twelve 
significant places by using (10.2.1). Since 77 = go already has one correct significant 
digit, and since about 1/R iterations are required to gain each additional significant digit 
of accuracy, (10.2.1) requires about 11/R iterations, whereas starting from scratch with an 
initial vector containing no significant digits of accuracy requires about 12/R iterations. In 
other words, the effort is reduced by about 8% for each correct significant digit that can be 
built into 7). This dictates how much effort should be invested in determining a “good” 
initial vector. 


To appreciate what this means concerning the effectiveness of using (10.2.1) as an 
updating technique, suppose, for example, that |A2| = .85 (as is common for PageRank), 
and suppose that the perturbations resulting from updating Q to G are such that each com- 
ponent 7; agrees with ¢; in the first significant digit. If (10.2.1) is used to produce twelve 
significant digits of accuracy, then it follows from (10.2.2) that about 156 iterations are 
required. This is only about 16 fewer than needed when starting blind with a random ini- 
tial vector. Consequently, restarting the power method with the old PageRank vector is 
not an overly attractive approach to the link-updating problem even when changes are rel- 
atively small. Because the power method is not easily adapted to handle more complicated 
page-updating problems, it’s clear that, by itself, restarting the power method with the old 
PageRank vector is not a viable updating technique. 


At this point, it seems that efficiently updating the stationary vector 77 of a Markov 
chain G with knowledge of Q and ¢7 may be too lofty a goal. The only available method 
for both link-updating and page-updating, restarting the power method with og’, has lit- 
tle benefit over starting completely from scratch, i.e., restarting the power method with a 
random or uniform vector. 


10.3 APPROXIMATE UPDATING USING APPROXIMATE AGGREGATION 


If, instead of aiming for the exact value of the updated stationary distribution, you are 
willing to settle for an approximation, then the door opens wider. For example, Steve Chien 
and his coworkers [48] estimate Google’s PageRank with an approximation approach that 
is based on state aggregation. State aggregation is part of a well-known class of methods 
known as approximate aggregation techniques [151] that have been used in the past to 
estimate stationary distributions of nearly uncoupled chains. The BlockRank algorithm of 
Chapter 9 used aggregation to accelerate the computation of PageRank. 


Even though it produces only estimates of 77 , approximate aggregation can handle 
both link-updating as well as page-updating, and it is computationally cheap. 


The underlying idea of approximate aggregation is to use the previously known dis- 


tribution " 
7) = (61, b2,---,¢m) 
together with the updated transition probabilities in G to build an aggregated Markov chain 
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having a transition probability matrix C that is smaller in size than G. The stationary 
distribution er of C is used to generate an estimate of the true updated distribution 27 as 


outlined below. 


The state space S of the updated Markov chain is first partitioned into two groups 
as S = L UL, where L is the subset of states whose stationary probabilities are likely 
to be most affected by the updates (newly added states are automatically included in L, 
and deleted states are accounted for by changing affected transition probabilities to zero). 
The complement F naturally contains all other states. The intuition is that the effect on the 
stationary vector of perturbations involving only a few states in large sparse chains (such as 
those in Google’s PageRank application) is primarily local, and as a result, most stationary 
probabilities are not significantly affected. Deriving good methods for determining L is a 
pivotal issue, and this is discussed in more detail in section 10.7. 


Partitioning the states of the updated chain as S = L U L induces a partition (and 
reordering) of the updated transition matrix and its respective stationary distribution 


z iF 
tet Gg 
Gaxn= (7 7? | oand a? =(m,.-.m |mgt,-:-5%), (10.3.1) 
L\ Goi Go 


where Gy, is / x 1 with 1 = |L| being the cardinality of L and and G2 is (n—1) x (n—1). 
The stationary probabilities from the original distribution go that correspond to the states 
in L are placed in a row vector w”’, and the states in L are lumped into one superstate to 
create a smaller aggregated Markov chain whose transition matrix is the (1+ 1) x (1+ 1) 
matrix given by 


~ ( Gu Gi2e 
C=|._ ¥ 
Ss? Go, 1- Ss’ Goie 


T 
e wW f 
) , where §/ = —— (e is a column of ones). 


wre 
(10.3.2) 
The approximation procedure in [48] computes the stationary distribution 


f= (,&,....8,8), 


~ <T 
for C and uses the first ] components in € along with those in w/ to create an approxima- 
tion 7” to the exact updated distribution 77 by setting 


# = (6,8... |W). (10.3.3) 
In other words, 


T= 


‘ é;, if state i belongs to L, 
i, if state i belongs to L. 


The theoretical justification for this approximation scheme along with its accuracy is dis- 
cussed in section 10.4. For now, it’s important to recognize the reduction in work that 
is possible with approximate aggregation. Rather than finding the full updated PageRank 
T 


vector 7 


~T 
*® ton’. 


<T 
, a much much smaller stationary vector € is used to build an approximation 


It’s reported in [48] that numerical experiments on chains with millions of states 
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provide estimates such that 
Il"? — #° ||, = O10). 


However, it’s not clear that this is a good result because it is an absolute error, and absolute 
errors can be deceptive indicators of accuracy in large chains. If a chain has millions of 
states, and if, as is reasonable to expect, some stationary probability 7; is on the order of 
10~°, then an approximation 7; can be as much as 100% different from the exact 7; in a 
relative sense yet yield a deceptively small absolute difference. Making the complete case 
should involve a relative measure. 


10.4 EXACT AGGREGATION 


The technique described in section 10.3 is simply one particular way to approximate the 
results of exact aggregation, which was developed in [125] and is briefly outlined below. 
For an irreducible n-state Markov chain whose state space has been partitioned into k 
disjoint groups S = L, U Lg U--: U Dx, the associated transition probability matrix 
assumes the block-partitioned form 


D4 De ooee Tie 
Iy [Gi Gig -:: Giz 
Lz} Goi Gog +--+ Gop : : 
Gaxn=. : ‘ : . (with square diagonal blocks). (10.4.1) 
Ly \ Gur Geo +++ Gre 


This parent Markov chain defined by G induces k smaller Markov chains, called censored 
chains, as follows. The censored Markov chain associated with a group of states L; is 
defined to be the Markov process that records the location of the parent chain only when 
the parent chain visits states in L;. Visits to states outside of L; are ignored. The transition 
probability matrix for the 27-th censored chain is known to be the t-th stochastic complement 
[125] given by the formula 


S; = Gy + G;,(I— G7) "Gy, (10.4.2) 


where G;, and G,; are, respectively, the 2-th row and the z-th column of blocks with G;; 
removed, and G* is the principal submatrix of G obtained by deleting the i-th row and 7-th 
column of blocks. For example, if the partition consists of just two groups S = L U LI, 
then there are only two censored chains, and their respective transition matrices are the two 
stochastic complements 


S; = Gi, + Gio(I— Go2)"'Gai and S2 = Go2 + Goi (I— G1) 7'Gio. 
If the stationary distribution for G is 7? = (af | 3'| --- | 27’ ) (partitioned conformably 
with G), then the i-th censored distribution (the stationary distribution for S;) is known to 
be equal to 

T ure 


8; = => (e is an appropriately sized column of ones). (10.4.3) 
7e 


For regular chains [104], the j-th component of s? is the limiting conditional probability 
of being in the 7-th state of group DL; given that the process is somewhere in L;. 
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To compress each group L, into a single state in order to create a small k-state 
aggregated chain, squeeze the parent transition matrix G down to the aggregated transition 
matrix (also known as the coupling matrix) by setting 


si Gie on si? Gipe 
Creek = : Ree : (known to be stochastic and irreducible). 
sf Gye vee sf Gure 
(10.4.4) 
For regular chains, transitions between states in the aggregated chain defined by C corre- 


spond to transitions between groups L; in the unaggregated parent chain when the parent 
chain is in equilibrium. 


The remarkable feature surrounding this aggregation idea is that it allows a parent 
chain to be decomposed into & small censored chains that can be independently solved, 
and the resulting censored distributions s? can be combined through the stationary distri- 
bution of C to construct the parent stationary distribution 7:7 . This is the exact aggregation 


theorem. 


Theorem 10.4.1. (The Exact Aggregation Theorem [125]). IfG is the block-partitioned 
transition probability matrix (10.4.1) for an irreducible n-state Markov chain whose sta- 
tionary probability distribution is 


mw! =(nt\|ab|--.|a2) (partitioned conformably with G), 


and if er = (£1, &,...,&) is the stationary distribution for the aggregated chain defined 
by the matrix Cxy% in cen then the stationary distribution for G is 


= (fs, | £o83 [8 - (E485), 


? is the censored distribution associated with the stochastic complement S; in 


where s; 
(10.4.2). 


10.5 EXACT VS. APPROXIMATE AGGREGATION 


While exact aggregation as presented in Theorem 10.4.1 is elegant and appealing with its 
divide and conquer philosophy, it’s an inefficient numerical procedure for computing 77 
because costly inversions are embedded in the Siehashe complements (10.4.2) that are 
required to produce the censored distributions s?. Consequently, it’s common to attempt 
to somehow approximate the censored diseibutions: and there are at least two methods for 
doing so. Sometimes the stochastic complements S; are first estimated (e.g., approximat- 
ing S; with G,; works well for nearly uncoupled chains). Then the distributions of these 
estimates are computed to provide approximate censored distributions, which in turn lead 
to an approximate aggregated transition matrix that is used by the exact aggregation the- 
orem to produce an approximation to 7”. The other approach is to bypass the stochastic 
complements altogether and somehow estimate the censored distributions s? directly, and 
this is the essence of the PageRank approximation scheme that was described in section 
10.3. 


To see this, consider the updated transition matrix G given in (10.3.1) to be parti- 
tioned into / + 1 levels in which the first / diagonal blocks are just 1 x 1, and the lower 
right-hand block is the (n — 1) x (n — 1) matrix Geo associated with the states in L. In 


106 CHAPTER 10 


other words, to fit the context of Theorem 10.4.1, the partition in (10.3.1) is viewed as 


= gu as gu Gis 
L L 
L{[G G : 7 : : 
Ge ( = ) =e ee ne (10.5.1) 
L\ Ga Go» gu ve gu Gi 
Gui pis Gui Goo 
where 
gic) Gil Gis 
Gu= Doe tT, Gia = : » and Go = (Gyi-+- Gy). 
Gu** Git Gi 


Since the first | diagonal blocks in the partition (10.5.1) are 1 x 1 (i.e., scalars), it’s evi- 
dent that the corresponding stochastic complements are S; = 1 (they are 1 x 1 stochastic 
matrices), so the censored distributions are s;/ = 1 fori = 1,...,1. This means that the 
exact aggregated transition matrix (10.4.4) associated with the partition (10.5.1) is 


git eee Jil Gi,e 
gu aire Jil Gi,e 
s’Gy1 38 s’Gyi s’Go2e (I+1) x (+1) 


( Gu Gye ( Gu Ge 
s’Go, s?’ Gove s’ Go, 1- s?Goie 


where s”’ is the censored distribution derived from the only significant stochastic comple- 


ment -i 
S = Goo + Gai (I = Gi) Gio. 


Compare the exact coupling matrix C above with the approximate C suggested by Chien 
et al. in equation (10.3.2). If the stationary distribution for C is 


er = (€1,. ae £1, 141), 


then exact aggregation (Theorem 10.4.1) ensures that the exact stationary distribution for 
Gis 
= (frye Oi |Gyas” ) = (ry .s29m |p) (10.5.3) 


It’s a fundamental issue to describe just how well the estimate nm given in equation 
(10.3.3) approximates the exact distribution 7” given in (10.5.3). Obviously, the degree 
to which 7; ~ 7; fori > I (i.e., the degree to which w” ~ 73’) depends on the degree 
to which the partition S = L U L can be adequately constructed. While it’s somewhat 
intuitive that this should also affect the degree to which 7; approximates 7; fori < J, it’s 
not clear, at least on the surface, just how good this latter approximation is expected to be. 
The analysis is as follows. 
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Instead of using the exact censored distribution s7 to build the exact aggregated 
matrix C in (10.5.2), the vector 87’ = w/w e is used to approximate s’ in order to 
construct C in (10.3.2). The magnitude of 


= Tv Ww 
6 =s! T 2 


and the magnitude of 


= 0 ) 0 
E=C-C= aes Bie) = (sr) Go) (I | = e) (10.5.4) 


are clearly of the same order. This suggests that if the partition S = DU L can be ade- 
quately constructed so as to insure that the magnitude of 6’ is small, then C is close to 


C, so their respective stationary distributions e and er should be close, thus ensuring 
that 7; and 7; are close for i < J. However, some care must be exercised before jump- 
ing to this conclusion because Markov chains can sometimes exhibit sensitivities to small 
perturbations. 


The effects of perturbations in Markov chains are well documented, and there are 
a variety of ways to measure the degree to which the stationary probabilities are sensitive 
to changes in the transition probabilities. These measures include the extent to which 
magnitude of the subdominant eigenvalue of the transition matrix is close to one [126, 128], 
the degree to which various “condition numbers” are small [50, 76, 83, 98, 123], and the 
degree to which the mean first passage times are small [49]. Any of these measures can be 
used to produce a detailed perturbation analysis that revolves around the perturbation term 
in (10.5.4), but, for the purposes at hand, it’s sufficient to note that it’s certainly possible for 
Ze and &; (and hence 7; and 7;) to be relatively far apart for i < 1 even when 5! (and hence 
E) have small components. For example, this badly conditioned behavior can occur if the 
magnitude of Gj2 is small because this ensures that the subdominant eigenvalue of C is 
close to | (and some mean first passage times are large), and this is known [49, 126, 128] 
to make the stationary probabilities sensitive to perturbations. Other aberrations in C can 
also cause similar problems. Of course, if the chain defined by C is well conditioned 
by any of the measures referenced above, then &7 will be relatively insensitive to small 
perturbations, and the degree to which w! ~ mJ (i.e., the degree to which S = LU L 
can be adequately constructed) will more directly reflect the degree to which 7; ~ 7; for 
i <1. The point being made here is that unless the degree to which C is well conditioned 
is established, the degree of the approximation in (10.3.3) is in doubt regardless of how 
well w! approximates 73. 


This may seem to be a criticism of the idea behind the approximation (10.3.3), but, to 
the contrary, the purpose of this chapter is to argue that this is in fact a good idea because it 
can be viewed as the first step in an iterative aggregation scheme that performs remarkably 
well. The following section is dedicated to developing an iterative aggregation approach 
to updating stationary probabilities. 


10.6 UPDATING WITH ITERATIVE AGGREGATION 


Iterative aggregation is an algorithm for solving nearly uncoupled (sometimes called nearly 
completely decomposable) Markov chains, and it is discussed in detail in [154]. Iterative 
aggregation is not a general-purpose technique, and it usually doesn’t work for chains that 
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are not nearly uncoupled. However, the ideas can be adapted to the updating problem, and 
these variations work extremely well, even when applied to Markov chains that are not 
nearly uncoupled. This is in part due to the fact that the approximate aggregation matrix 
(10.3.2) differs from the exact aggregation matrix (10.5.2) in only one row, namely, the last 
row. The iterative aggregation updating algorithm is described below. 


Assume that the stationary distribution 


gr = (1, b2,-++,Pm) 


for some irreducible Markov chain C is already known, perhaps from prior computations, 
and suppose that C needs to be updated. As in earlier sections, let the transition probability 
matrix and stationary distribution for the updated chain be denoted by G and 


nos (11, 72,--+;7n), 


respectively. The updated matrix G is assumed to be irreducible. Of course, the specific 
application we have in mind is Google’s PageRank (in which case the matrix is guaranteed 
to be irreducible), but this method can be used to update other general irreducible Markov 
chains. Notice that m is not necessarily equal to n because the updating process may add 
or delete states as well as alter transition probabilities. 


THE ITERATIVE AGGREGATION UPDATING ALGORITHM 


Initialization 
e Partition the states of the updated chain as S = LU L and reorder G as described 
in (10.3.1) 


e w! <— the components from go that correspond to the states in L 


e = s?<—w" /(w%e) (an initial approximate censored distribution) 


Iterate until convergence 


Gu Give 
1. C — (l= |L]) 
s?Go1 1- s’Goie (1+1)x (I+1) 


a ae ee (£1, &2,---,&1,€141), the stationary distribution of C 
2: xP — (€1,€2,-.-5 61 | G418") 

T TA fopT | pT . : 
4, yw —yx°G= (7 |%5) (see note following the algorithm) 


5. If |b’ — x7 || < 7 for a given tolerance 7, then quit—else s?’ —— 3 /3e and 
go to step 1 


Note concerning step 4. Step 4 is necessary because the vector 77 generated in step 3 is 
a fixed point in the sense that if step 4 is omitted and the algorithm is restarted with ~7 
instead of wy, then the same x7 is simply reproduced at step 3 on each subsequent itera- 
tion. Step 4 has two purposes—it moves the iterate off the fixed point while simultaneously 
contributing to the convergence process. Step 4 is the analog of the smoothing operation in 
algebraic multigrid algorithms, and it can be replaced by a step from almost any iterative 
procedure used to solve linear systems—e.g., a Gauss-Seidel step [154] is sometimes used. 
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While precise rates of convergence for general iterative aggregation algorithms are 
difficult to articulate, the specialized nature of our iterative aggregation updating algorithm 
allows us to easily establish its rate of convergence. The following theorem shows that 
this rate is directly dependent on how fast the powers of the one significant stochastic 
complement S = Go2 + Gai (I — Gii) Giz converge. In other words, since S is an 
irreducible stochastic matrix, the rate of convergence is completely dictated by the largest 
subdominant eigenvalue (and Jordan structure) of S. 


Theorem 10.6.1. (Convergence Theorem for the Iterative Aggregation Updating Algo- 
rithm [111]). The iterative aggregation updating algorithm defined above converges to the 
stationary distribution x7 of G for all partitions S = L U L. The rate at which the iterates 
converge to 7" is exactly the rate at which the powers S” converge, which is dictated by 
the largest subdominant eigenvalue \2 (and Jordan structure) of S. In the common case 
when 2 is real and simple, the iterates converge to 7" at the rate at which 3% — 0. 


Further, Ilse Ipsen and Steve Kirkland have proven that, under a few assumptions 
(that are easily satisfied for the PageRank case), the rate of convergence of this iterative 
aggregation updating algorithm is always less than or equal to the rate of convergence of 
the standard power method [97]. 


10.7 DETERMINING THE PARTITION 


The iterative aggregation updating algorithm always converges, and it never requires more 
iterations than the power method to attain a given level of convergence. However, iterative 
aggregation requires more work per iteration than the power method. The key to realizing 
an improvement in iterative aggregation over the power method rests in properly choosing 
the partition S = L UL. As Theorem 10.6.1 shows, good partitions are precisely those 
that yield a stochastic complement S = Goo + Goi (I — Gii) Gig whose subdominant 
eigenvalue Az is small in magnitude. 


While it’s not a theorem, experience indicates that as |L| = / (the size of G11) be- 
comes larger, iterative aggregation tends to converge in fewer iterations. But as | becomes 
larger, each iteration requires more work, so the trick is to strike an acceptable balance. A 
small / that significantly reduces |Ao| is the ideal situation. 


Even for moderately sized problems there is an extremely large number of possible 
partitions, but there are some useful heuristics that can help guide the choice of L so that 
reasonably good results are produced. For example, a relatively simple approach is to take 
L to be the set of all states “near” the updates, where “near” might be measured in a graph 
theoretic sense or else by the magnitude of transient flow. In the absence of any other 
information, this is not a completely bad strategy, and it is at least a good place to start. 
However, there are usually additional options that lead to even better “L-sets,’ and some 
of these are described below. 


10.7.1 Partitioning by Differing Time Scales 


In most applications involving irreducible aperiodic Markov chains the components of the 
n-th step distribution vector do not converge at a uniform rate, and consequently iterative 
techniques, including the power method, often spend the majority of the time resolving a 
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minority of slow converging components. The slow converging components can be isolated 
either by monitoring the process for a few iterations or by theoretical means such as those 
described in [111]. (Section 9.1 already introduced the idea and detection of slow vs. fast 
converging states for the PageRank problem.) If the states corresponding to the slower 
converging components are placed in L while the faster converging states are lumped into 
L, then the iterative aggregation algorithm concentrates its effort on resolving the smaller 
number of slow converging states. 


In loose terms, the effect of steps 1-3 in the iterative aggregation algorithm is es- 
sentially to make progress toward achieving an equilibrium (or steady state) for a smaller 
chain consisting of just the “slow states” in L together with one additional aggregated 
state that accounts for all “fast states” in L. The power iteration in step 4 moves the entire 
process ahead on a global basis, so if the slow states in L are substantially resolved by 
steps 1-3, then not many global power steps are required to push the entire chain toward 
its global equilibrium. This is the essence of the original Simon—Ando idea as explained 
and analyzed in [151] and [125]. If / = |L| is small relative to n, then steps 1-3 are rel- 
atively cheap to execute, so the process can converge rapidly (in both iteration count and 
wall-clock time). Examples are given in [111]. 


In some applications the slow states are particularly easy to identify because they are 
the ones having the larger stationary probabilities. This is a particularly nice state of affairs 
for the updating problem because we have the stationary probabilities from the prior period 
at our disposal, so all we have to do to construct a good L-set is to include the states with 
large prior stationary probabilities and throw in the states that were added or updated along 
with a few of their nearest neighbors. Clearly, this is an advantage only when there are just 
a few “large” states. Fortunately, it turns out that this is a characteristic feature of Google’s 
PageRank application and other scale-free networks with power law distributions. 


10.7.2 Scale-Free Networks and Google’s PageRank 


As discussed in [16, 17, 41, 63], the link structure of the Web constitutes a “scale-free” 
network. This means that the number of nodes n(j) having 7 edges (possibly directed) 
is proportional to j~* where k is a constant that doesn’t change as the network expands 
(hence the term “‘scale-free’’). In other words, the distribution of nodal degrees seems to 
follow a “power law distribution” in the sense that 


1 
Prob|deg(N) = d] x aR for some & > 1. 
(The symbol « is read “is proportional to.”) For example, studies [16, 17, 41, 63] have 
shown that for the Web the parameter for the indegree power-law distribution is k ~ 2.1, 


while the outdegree distribution has k = 2.7. 


The scale-free nature of the Web translates into a power law for PageRanks. In fact, 
experiments described in [63, 136] indicate that PageRank has a power law distribution 
with a parameter k ~ 2.1. In other words, there are relatively very few pages that have a 
significant PageRank while the overwhelming majority of pages have a nearly negligible 
PageRank. Consequently, when PageRanks are plotted in order of decreasing magnitude, 
the resulting graph has a pronounced “L” shape with an extremely sharp bend. Figure 10.1 
shows the PageRanks sorted in decreasing order of magnitude for a sample web graph con- 
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taining over 6,000 pages collected from the hollins.edu domain. It’s this characteristic 
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Figure 10.1 Power law distribution of PageRanks 


“[_-shape” of PageRank distributions that reveals a near optimal partition S = L U L, as 
described in the next section and shown experimentally in [111]. 


10.8 CONCLUSIONS 


Reference [111] contains the results of numerous experiments that apply the iterative ag- 
gregation algorithm to update the PageRank for small subsets of the Web. The experiments 
lead to several conclusions. 


1. The iterative aggregation technique provides a significant improvement over the 
power method when a good L-set is used. In some cases, it requires less than 1/7 of 
the time required by the power method. 


2. The improvements become more pronounced as the size of the datasets increases. 


3. The iterative aggregation approach offers room for even greater improvements. For 
example, the extrapolation technique introduced in section 9.1 can be employed in 
conjunction with the iterative aggregation algorithm to further accelerate the updat- 
ing process. 


4. Good L-sets can be constructed by: 


e first putting all new states and states with altered links (perhaps along with 
some nearest neighbors) into L, 


e then adding other states that remain after the update in order of the magnitude 
of their prior stationary probabilities up to the point where these stationary 
probabilities level off (i.e., include states to the left of the bend in the PageRank 
L-curve). 
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Of course, there is some subjectiveness in this strategy for choosing the L-set. How- 
ever, the leveling-off point is relatively easy to discern in distributions having a very 
sharply defined bend in the L-curve, and only distributions that gradually die away 
or do not conform to a power distribution are problematic. 


Finally (but very important), when iterative aggregation is used as an updating tech- 
nique, the fact that updates change the problem size is of little or no consequence. 
Thus, the algorithm is the first to handle both types of updates, link and page updates. 


ASIDE: The Google Dance 


It’s believed that Google updates their PageRank vector on a monthly basis. The pro- 
cess of updating is known as the Google Dance because pages dance up and down the rank- 
ings during the three days of updating computation. There’s a nifty tool called the Google 
Dance Tool (http://www. seochat.com/googledance/) for watching this dance. 
The tool sends the query off to Google’s three primary servers. First, it goes to the main 
Google server, www. google.com, then the two auxiliary servers www2 . google.com and 
www3 .google.com, which are believed to be test servers. The tool reports the three sets of 
top ten rankings side by side in a chart. 


Most times of the month, the three lists show little or no variation. But it’s clear when 
Google is in the process of updating, the lists vary substantially. It’s possible that during the 
updating time of the month, the main server uses last month’s PageRank vector, then the test 
servers show rankings that use iterates of the updated PageRank vector as it is being computed. 
After a few days, when the dancing is done, all servers show the same lists again as they all 
use the completely updated PageRank vector. 


Many webmasters have come to fear the Google Dance. After working so hard to im- 
prove their rankings (by hook or by crook or by good content), just a slight tweak by Google 
in their PageRank or content score algorithms can ruin a webmaster’s traffic and business. In 
fact, the famous ethical SEO guru Danny Sullivan (see the aside on page 43) created the term 
Google Dance Syndrome (GDS) to describe the ailment that some webmasters suffer each 
month. In May 2003, there was a huge outbreak of GDS when Google made some substantial 
modifications to its algorithms, adding spam filters, quick fresh updates for popular pages, and 
more mirror sites. In September 2002 with their usual playful style, Google hosted an actual 
Google Dance (see photos at http: //www. google. com/googledance2002/), invit- 
ing attendees of the nearby Search Engine Strategies Conference to the Googleplex to dance 
the night away. 


ASIDE: Googleopoly 


Reporters use the term Googleopoly to refer to Google’s dominance of web search. In 
May 2004, Google claimed 36.8% of the market. The Yahoo conglomerate, which includes 
Yahoo, AllTheWeb, AltaVista, and Overture, took second place with 26.6%, MSN followed 
with 14.5%. Google has steadily added more handy features like an online calculator, dictio- 
nary, and spelling correction. Their recent rollout of Gmail, their email service that allows 
for 1KB of storage and search within messages and message threads, has convinced many 
that Google is poised to completely take over the market. BBC technology journalist Bill 
Thompson has gone as far as to claim that government intervention is needed to break up 
the Googleopoly. Thompson says that Google is a public utility that must be regulated in the 
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public interest. Googlites personally defend Google, citing the company’s history of mak- 
ing morally good decisions and referring to the monopoly-busting cries as alarmist chatter. 
Besides, they argue, Brin and Page made the company motto, “Don’t be evil,” which clearly 
reveals their earnest intentions. 


Librarians deal with the Googleopoly everyday. They have to beg students to use search 
services other than Google. Students are often surprised when a librarian finds a piece of 
information that they couldn’t find on Google. It’s as if the information doesn’t exist if it’s not 
on Google. Other diversified, specialized search tools have great value that a general purpose 
engine like Google can’t supply. As the librarians preach, learn to use several search engines, 
general and specialized, and watch your search skills multiply. Incidentally, number 8 on the 
top 10 list of signs that you’re addicted to Google is: shouting at the librarian if he takes 
longer than .1 seconds to find your information. A related bit of humor appears in the form 
of a cartoon that is floating around the Web. The cartoon pictures Bart Simpson learning his 
lesson by writing “I will use Google before asking dumb questions” over and over again on 
the chalkboard. 
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Chapter Eleven 
The HITS Method for Ranking Webpages 


If you’re a sports fan, you’ve seen those “ is Life” t-shirts, where the blank is filled 
in by a sport like football, soccer, cheerleading, fishing, etc. After reading the first ten 
chapters of this book, you might be ready to declare “Google is Life.’ But your mom 
probably told you long ago that “there’s more to life than sports.” And there’s more to 
search than Google. In fact, there’s Teoma, and Alexa, and A9, to name a few. The next 
few chapters are devoted to search beyond Google. This chapter focuses specifically on 
one algorithm, HITS, the algorithm that forms the basis of Teoma’s popularity ranking. 


11.1 THE HITS ALGORITHM 


We first introduced HITS, the other system for ranking webpages by popularity back in 
Chapter 3. Since that was many pages ago, we review the major points regarding HITS. 
HITS, which is an acronym for Hypertext Induced Topic Search, was invented by Jon 
Kleinberg in 1998—around the same time that Brin and Page were working on their 
PageRank algorithm. HITS, like PageRank, uses the Web’s hyperlink structure to create 
popularity scores associated with webpages. However, HITS has some important differ- 
ences. Whereas the PageRank method produces one popularity score for each page, HITS 
produces two. Whereas PageRank is query-independent, HITS is query-dependent. HITS 
thinks of webpages as authorities and hubs. An authority is a page with many inlinks, and 
a hub is a page with many outlinks. Authorities and hubs deserve the adjective good when 
the following circular statement holds: Good authorities are pointed to by good hubs and 
good hubs point to good authorities. And so every page is some measure of an authority 
and some measure of a hub. The authority and hub measures of HITS have been incorpo- 
rated into the CLEVER project at IBM Almaden Research Center [2]. HITS is also part of 
the ranking technology used by the new search engine Teoma [150]. 


After this recap, we are ready to translate these words about what HITS does into 
mathematics. Every page 7 has both an authority score x; and a hub score y;. Let EF be 
the set of all directed edges in the web graph and let e;; represent the directed edge from 
node 7 to node 7. Given that each page has somehow been assigned an initial authority 


(0) (0) 


score x; ’ and hub score y;"’, HITS successively refines these scores by computing 


we SS gl) ond yl Se Soa for 12,3 .sey AD 
peck jiegek 


These equations, which were Kleinberg’s original equations, can be written in matrix 
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form with the help of the adjacency matrix L of the directed web graph. 


Le = 1, if there exists an edge from node 7 to node 7, 
‘I~ | 0, otherwise. 


For example, the adjacency matrix L for the small graph in Figure 11.1 is 


P/0 1 1 0 
eels ts 30 
Pei Oe HG 1 
PrAOs~ te - 208-20 


Figure 11.1 Graph for 4-page web 


In matrix notation, the equations in (11.1.1) assume the form 
x) =LTy@) and y) =Lx™, 
where x(*) and y“) are n x 1 vectors holding the approximate authority and hub scores at 
each iteration. 


This leads to the following iterative algorithm for computing the ultimate authority 
scores x and hub scores y. 


THE ORIGINAL HITS ALGORITHM 


1. Initialize: y(°) = e, where e is a column vector of all ones. Other positive starting 
vectors may be used. (See section 11.3.) 


2. Until convergence, do 
x) =ETy&-) 
y*) = Lx) 
k=k+1 


Normalize x) and y*), (See section 11.3.) 
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Note that in step 2 of this algorithm, the two equations 
x) =ETy@-) 
y‘*) = Lx) 
can be simplified by substitution to 
x 6*) = L2Lx@-) 
y) =LLTy%-), 


These two new equations define the iterative power method for computing the dominant 
eigenvector for the matrices L7L and LL’. This is very similar to the PageRank power 
method of Chapter 4, except a different coefficient matrix is used (L7 L or LL) instead of 
the Google matrix G. Since the matrix L’ L determines the authority scores, it is called the 
authority matrix, and LL? is known as the hub matrix. L7L and LL? are symmetric 
positive semidefinite matrices. Computing the authority vector x and the hub vector y can 
be viewed as finding dominant right-hand eigenvectors of L7L and LL’, respectively. 


11.2 HITS IMPLEMENTATION 


The implementation of HITS involves two main steps. First, a neighborhood graph NV 
related to the query terms is built. Second, the authority and hub scores (x and y) for 
each page in N are computed, and two ranked lists of the most authoritative pages and 
most “hubby” pages are presented to the user. Since the second step was described in the 
previous section, we focus on the first step. All pages containing references to the query 
terms are put into the neighborhood graph N. There are various ways to determine these 
pages. One simple method consults the inverted file index (see Chapter 2), which might 
look like: 


e term | (aardvark) - 3, 117, 3961 


e term 10 (aztec) - 3, 15, 19, 101, 673, 1199 
e term 11 (baby) - 3, 31, 56, 94, 673, 909, 11114, 253791 


e term m (zymurgy) - 1159223 


For each term, the pages mentioning that term are stored in list form. Thus, a query on 
terms 1 and 10 would pull pages 3, 15, 19, 101, 117, 673, 1199, and 3961 into N. Next, 
the graph around the subset of nodes in N is expanded by adding nodes that point either 
to or from nodes in N. This expansion allows some semantic associations to be made. 
That is, for the query term car, with the expansion about pages containing car, some 
pages containing automobile may now be added to N (presuming some pages about cars 
point to pages about automobiles and vice versa). This usually resolves the problem of 
synonyms. However, the set VV can become very large due to the expansion process; a page 
containing the query terms may possess a huge indegree or outdegree. Thus, in practice, 
the maximum number of inlinking nodes and outlinking nodes to add for a particular node 
in N is fixed, at say 100, in which case only the first 100 outlinking nodes of a page 
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containing a query term are added to N. (The process of building the neighborhood graph 
is strongly related to building level sets in information filtering, which reduces a sparse 
matrix to a much smaller more query-relevant matrix [165].) 


Once the set N is built, the adjacency matrix L corresponding to the nodes in N 
is formed. The order of L is much smaller than the total number of pages on the Web. 
Therefore, computing authority and hub scores using the dominant eigenvectors of L7L 
and LL? incurs a small cost, small in comparison to computing authority and hub scores 
when all documents on the Web are placed in N (as is done by the PageRank method). 


An additional cost reduction exists. Only one dominant eigenvector needs to be 
computed, that of either L?L or LL’, but not both. For example, the authority vector x 
can be obtained by computing the dominant eigenvector of L’'L, then the hub vector y 
can be obtained from the equation y = Lx. A similar statement applies if the hub vector 
is computed first from the eigenvector problem. 


Notation for the HITS Problem 


N neighborhood graph 
L sparse binary adjacency matrix for N 
L’L - sparse authority matrix 
LL’ - sparse hub matrix 
number of pages in N = order of L 
authority vector 


hub vector 


Matlab m-file for the HITS algorithm 


This m-file is a Matlab implementation of the HITS power method given in sec- 
tion 11.1. 


function [x,y,time,numiter]=hits(L,x0,n,epsilon) ; 


% HITS computes the HITS authority vector x and hub vector y 
for an n-by-n adjacency matrix L with starting vector 
x0 (a row vector). Uses power method on L’*L. 


dP dP dP oo 


EXAMPLE: [x,y,time,numiter]=hits(L,x0,100,1le-8); 


0 od 


INPUT: L = adjacency matrix (n-by-n sparse matrix) 
x0 = starting vector (row vector) 
n = size of L matrix (integer) 
epsilon = convergence tolerance (scalar, e.g. le-8) 


dP dP dP cP oP 


OUTPUT: x = HITS authority vector 
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z y = HITS hub vector 
& time = time until convergence 
z numiter = number of iterations until convergence 


% The starting vector is usually set to the uniform vector, 
% x0=1/n*ones(1,n). 


k=0; 
residual=1; 
x=x0; 

tic: 


while (residual >= epsilon) 
prevx=x; 
k=k+1; 
Sar Li 


x=x*L; 
x=x/sum(xX) ; 
residual=norm(x-prevx,1); 
end 
y=x*L'; 
y=y/sum(y) ; 
numiter=k; 
time=toc; 


11.3 HITS CONVERGENCE 


The iterative algorithm for computing HITS vectors is actually the power method (our 
friend from Chapter 4) applied to L7L and LL’. For a diagonalizable matrix By yn 
whose distinct eigenvalues are {1, A2,..., Ax} such that |Ay| > |A2| > |Asl--- > Axl, 
the power method takes an initial vector x) and iteratively computes 


(k) 
xh) x 


(k) (k-1) pea 
x = Bx ‘ ax)’ 


where m(x“)) is a normalizing scalar derived from x). For example, it is common to 
take m(x*)) to be the (signed) component of maximal magnitude (use the first if there are 
more than one), in which case m(x“*)) converges to the dominant eigenvalue \;, and x“*) 
converges to an associated normalized eigenvector [127]. If only a dominant eigenvector 
is needed (and not the eigenvalue \}), then a normalization such as m(x“*)) = ||x || can 
be used. (If A; < 0, then m(x‘*)) = ||x“*)|| can’t converge to \1, but x*) still converges 
to a normalized eigenvector associated with 1.) The asymptotic rate of convergence of the 
power method is the rate at which (|A2(B)|/|A1(B)|)* — 0. 


The matrices L? L and LL? are symmetric, positive semidefinite, and nonnegative, 
so their distinct eigenvalues {\,,2,...,%} are necessarily real and nonnegative with 
Ay > Ag > +++ > Ax => O. In other words, it is not possible to have multiple eigenvalues 
on the spectral circle. Consequently, the HITS specialization of the power method avoids 
most problematic convergence issues—HITS with normalization always converges. And 
the rate of convergence is given by the rate at which [\2(L7L)/\;(L7L)|* — 0. Unlike 
PageRank, we cannot give a good approximation to the asymptotic rate of convergence for 
HITS. (Recall that the asymptotic rate of convergence for the PageRank problem is the rate 
at which a* — 0.) Many experiments show the eigengap (A; — Az) for HITS problems 
to be large, and researchers suggest that only 10-15 iterations are required for convergence 
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[59, 60, 106, 134, 133]. However, despite the quick convergence, there can be a problem 
with the uniqueness of the limiting authority and hub vectors. While A; > Ag, the structure 
of L might allow A; to be a repeated root of the characteristic polynomial, in which case 
the associated eigenspace is multidimensional. This means that different limiting authority 
(and hub) vectors can be produced by different choices of the initial vector. 


A simple example from [72] demonstrates this problem. In this example, 


000 0 200 0 
oF Pe? 0: 0 ee 1 Oo BAe HO 
Bla Sogo. Se a ate SG 

0110 000 0 


The authority matrix L7L (and also the hub matrix LL’) has two distinct eigenvalues, 
A, = 2 and Ay = 0, which are each repeated twice. For the initial vector xO) =] /4 e!” 
the power method applied to L7'L (with normalization by the 1-norm) converges to the 
vector x(°) = (1/3 1/3 1/3 0)". Yet forx = (1/4 1/8 1/8 1/2)", the 
power method converges to x(°) = (1/2 1/4 1/4 0)”. At the heart of this unique- 
ness problem is the issue of reducibility. 


A square matrix B is said to be reducible if there exists a permutation matrix Q such 


Q™BQ = G y ) , where X and Z are both square. 


that 

Otherwise, the matrix is irreducible. The reducibility of a matrix means that there’s a set 
of states that it’s possible to enter, but once entered, it’s impossible to exit. On the other 
hand, a matrix is irreducible if every state is reachable from every other state. The Perron- 
Frobenius theorem [127] ensures that an irreducible nonnegative matrix possesses a unique 
normalized positive dominant eigenvector, called the Perron vector. Consequently, it’s the 
reducibility of L7L that causes the HITS algorithm to converge to nonunique solutions. 
PageRank actually encounters the same uniqueness problem, but the Google founders sug- 
gested a way to cheat and alter the matrix, forcing irreducibility (actually primitivity as 
well) and hence guaranteeing existence and uniqueness of the ranking vector—see sec- 
tion 4.5. A modification similar to the Google primitivity trick can also be applied to 
HITS. That is, a modified authority matrix €L7L + (1 — €)/nee? can be created, where 
0 < € < 1 [134]. The modified hub matrix is similar. Miller et al. [72] and Ng et 
al. [134] have developed similar modifications to HITS, called Exponentiated HITS and 
Randomized HITS. 


One final caveat regarding the power method concerns the starting vector x). In 
general, regardless of whether the dominant eigenvalue A, of the iteration matrix B is 
simple or repeated, convergence to a nonzero vector depends on the initial vector x) 
not being in the range of (B — \1I). If x(9) is randomly generated, almost certainly this 
condition will hold, so in practice this is rarely an issue. 


11.4 HITS EXAMPLE 


We present a very small example to demonstrate the implementation of the HITS algorithm. 
First, a user presents query terms to the HITS system. There are several schemes that can be 
used to determine which nodes “contain” query terms. For instance, one could take nodes 
using at least one query term. Or to create a smaller sparse graph, one could take only nodes 
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using all query terms. For our example, suppose the subset of nodes containing the query 


terms is {1,6}. Next, we build the neighborhood graph about nodes | and 6. Suppose this 
produces the following graph NV, shown in Figure 11.2. From this neighborhood graph N, 


Figure 11.2 Neighborhood graph N for pages 1 and 6 


the adjacency matrix L is formed. 


1 2 3 5 6 10 

1/0 0 1 0 1 +0 

2/1 00 0 0 0 

L=? 000 0 1 =O 

5 |0 0 0 0 0 0 

6 |0 0 1 1 0 0 

10\0 0 00 1 +0 

The respective authority and hub matrices are: 

1 2 3 5 6 10 1 2 3 5 6 10 
1 /1 00 0 0 0 1/2 0101 1 
270 00 0 0 0 2 {0 10 0 0 0 
pr, 3 | 0 0 2 1 1 +0 ry 3 | 10100 1 
PS BO: Oe a, 10: Be |g Oost 0 
6 ;0 0 1 0 38 0 6 {1 00 0 2 0 
10\0 0 0 0 0 0O 10\l 01 00 1 


The normalized principal eigenvectors with the authority scores x and hub scores y are: 


x’=(0 0 .3660 .1340 .5 0) and 
y’ =(.3660 0 .2113 0 .2113  .2113). 


This example shows that there are two types of ties that can occur: ties at 0 and ties at 
positive values. Ties at 0 can be avoided with the primitivity modification suggested at 
the end of section 11.3. For the much larger matrices that occur in practice, the existence 
of identical positive values in a dominant eigenvector is unlikely. Nevertheless, ties may 
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occur and can be broken by any tie-breaking strategy. Using a “first-come, first-served” 
tie-breaking strategy, the authority and hub scores are sorted in decreasing order and the 
page numbers are presented. 


Authority ranking=(6 3 5 1 2 10), 
Hub ranking=(1 3 6 10 2 5). 


This means that page 6 is the most authoritative page for the query while page | is the best 
hub for this query. 


For comparison purposes, we now recompute the authority and hub vectors for the 
modified HITS method, using the irreducible matrix €L7L + (1 — €)/nee™ as the au- 
thority matrix and LL? + (1 — €)/nee” as the hub matrix. With this modification, the 
matrices are irreducible, and by the Perron-Frobenius theorem, they each possess a unique, 
normalized, positive dominant eigenvector (called the Perron vector). For the case when 
€ = .95, 


x’ =(0.0032 0.0023 0.3634 0.1351 0.4936 0.0023) and 
y’ =(0.3628 0.0032 0.2106 0.0023 0.2106 0.2106). 


Notice that, for this example, this irreducible modification does not change the authority 
and hub rankings. Yet these modified scores are more appealing because they are unique 
and positive (and thus, avoid ties at 0), and the power method is guaranteed to converge to 
them in a finite number of steps. 


11.5 STRENGTHS AND WEAKNESSES OF HITS 


One advantage of the HITS algorithm is its dual rankings. HITS presents two ranked lists 
to the user: one with the most authoritative documents related to the query and the other 
with the most “hubby” documents. As a user, it’s nice to have this option. Sometimes you 
want authoritative pages because you are searching deeply on a research query. Other times 
you want hub (or portal) pages because you’re doing a broad search. Another advantage 
of HITS is the size of the problem. HITS casts the ranking problem as a small problem, 
finding the dominant eigenvectors of small matrices. The size of these matrices is very 
small relative to the total number of pages on the Web. 


However, there are some clear disadvantages to the HITS ranking system. Most 
troublesome is HITS’s query-dependence. At query time, a neighborhood graph must be 
built and at least one matrix eigenvector problem solved. And this must be done for each 
query. Of course, it’s easy to make HITS query-independent. Simply, drop the neighbor- 
hood graph step and compute the authority and hub vectors, x and y, using the adjacency 
matrix of the entire web graph. For more on the query-independent version of HITS, see 
section 11.7. 


HITS’s susceptibility to spamming creates a second strong disadvantage. By adding 
links to and from your webpage, you can slightly influence the authority and hub scores 
of your page. A slight change in these scores might be enough to move your webpage a 
few notches up the ranked lists returned to users. We’ve already mentioned how important 
it is to get into the first few pages of a search engine’s results since users generally view 
only the top 20 pages returned in a ranked list. Of course, adding outlinks from your 
page is much easier than adding inlinks. So influencing your hub score is not difficult. 
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Yet since hub scores and authority scores share a mutual dependence and are computed 
interdependently, an authority score will increase as a hub score increases. Further, since 
the neighborhood graph is small in comparison to the entire Web, local changes to the link 
structure appear more drastic. Fortunately, Monika Henzinger and Krishna Bharat have 
proposed a modification to HITS that mitigates the problem of link spamming by using 
something called an L1 normalization step [26]. 


A final disadvantage of HITS is the problem of topic drift. In building the neigh- 
borhood graph N for a query it is possible that a very authoritative yet off-topic page be 
linked to a page containing the query terms. This very authoritative page can carry so much 
weight that it and its neighboring documents dominate the relevant ranked list returned to 
the user, skewing the results toward off-topic documents. Henzinger and Bharat suggest a 
solution to the problem of topic drift, weighting the authority and hub scores of the nodes 
in N by a measure of relevancy to the query [26]. In fact, to measure relevance of a node 
in N to the query, they use the same cosine similarity measure that is often used by vector 
space methods such as LSI [24, 64]. This solution to the topic drift problem is similar 
to the intelligent surfer modification to the basic PageRank model (see section 5.2). The 
binary elements in L (rather than H for the PageRank model) are given weights, which in 
effect improves the IQ of the HITS system. 


11.6 HITS’S RELATIONSHIP TO BIBLIOMETRICS 


The HITS algorithm has strong connections to bibliometrics research. Bibliometrics is the 
study of written documents and their citation structure. Such research uses the citation 
structure of a body of documents to produce numerical measures of the importance and 
impact of papers. Chris Ding and his colleagues at the Lawrence Berkeley National Labo- 
ratory have noted the underlying connection between HITS and two common bibliometrics 
concepts, co-citation and co-reference [59, 60]. 


In bibliometrics, co-citation occurs when two documents are both cited by the same 
third document. Co-reference occurs when two documents both refer to the same third 
document. On the Web, co-citation occurs when two nodes share a common inlinking 
node, while co-reference means two nodes share a common outlinking node. Ding et al. 
have shown that the authority matrix L7 L of HITS has a direct relationship to the concept 
of co-citation, while the hub matrix LL” is related to co-reference [59, 60]. Suppose the 
small hyperlink graph of Figure 11.1 is studied again. The adjacency matrix is 


Py Ps? Py Pi 


P/f/o 1 1 0 

L= Po} 1 0 1 0O 

~ Pst 0 1 0 1 

P\O 1 0 0 

So the authority and hub matrices are: 
1 0 1 0 
tr_{9 38 1 1ly)_y, 
L*L= (4. 3.0 = Din + Cex, and 

0 10 1 
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LL? = = Dout = Cref. 
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CO ON 


1 1 


In general, Ding et al. [59, 60] show that LTL = Din + Coit, where D;,, is a diagonal 
matrix with the indegree of each node along the diagonal and C,,;; is the co-citation matrix. 
For example, the (3, 3)-element of L? L means that node 3 has an indegree of 2. The (1, 3)- 
element of L7'L means that nodes 1 and 3 share only one common inlinking node, node 
2, as is apparent from Figure 11.1. The (4,3)-element of L7'L implies that nodes 3 and 4 
do not share a common inlinking node, again, as is apparent from Figure 11.1. Similarly, 
the hub matrix is actually Dout + Cres, where Douz is the diagonal matrix of outdegrees 
and C,,.f is the co-reference matrix. The (1, 2)-element of LL? means that nodes | and 2 
share a common outlinking node, node 3. The (4, 2)-element implies that nodes 4 and 2 do 
not share a common outlinking node. Ding et al. use these relationships between authority 
and co-citation and hubs and co-reference to claim that simple inlink ranking provides a 
decent approximation to the HITS authority score and simple outlink ranking provides a 
decent approximation to hub ranking [59, 60, 61]. 


11.7 QUERY-INDEPENDENT HITS 


HITS can be forced to be query-independent by computing a global authority and a global 
hub vector, which consequently slightly reduces the influence of link spamming. An effi- 
cient, foolproof way to do this is with the algorithm below, which is guaranteed to converge 
to the unique positive hub and authority vectors, regardless of the reducibility of the web 
graph (because the modified HITS matrices are used). We recommend automatically us- 
ing the modified HITS matrices because the web graph associated with an engine’s entire 
index will almost certainly be reducible, and therefore cause convergence and uniqueness 
problems for HITS. 


A QUERY-INDEPENDENT MODIFIED HITS ALGORITHM 


1. Initialize: x) = e/n, where e is a column vector of all ones. (Other positive 
normalized starting vectors may be used.) 


2. Until convergence, do 


x(*) — €LTLx®-) + (1 — €)/ne 
x OF) = 6) |), 
y\") =€LLTy*-) + (1-8/ne 
yay |ly hs 

k=k+1 


3. Set the authority vector x = x“*) and the hub vector y = y“). 


When the query-independent HITS algorithm is used, L is the adjacency matrix for 
the search engine’s entire web graph, because the neighborhood graph N is no longer 
formed. If Teoma used query-independent HITS, L would be about 1.5 billion in size. 
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It’s worthwhile to compare the query-independent HITS algorithm above with the 
other query-independent ranking method, PageRank. The work in each step boils down 
to the matrix-vector multiplications: L7Lx‘*—)) for HITS, L?D~!x(*—)) for random 
surfer PageRank, and H?x‘*—") for intelligent surfer PageRank. The approximate work 
required by one iteration of each method is given in Table 11.1. Here nnz(L) is the number 
of nonzeros in L and 7 is the size of the matrix. 


Table 11.1 Work per iteration required by the query-independent ranking methods 


Method | Multiplications | Additions 
HITS 0 2nnz(L) 
Modified HITS 0 4nnz(L) + 2n 
Random surfer PageRank n nnz(L) +n 
Intelligent Surfer PageRank | nnz(H) nnz(H)+n 


For query-independent HITS, nnz(L) = nnz(H), but for query-dependent HITS 
nnz(L) < nnz(H), where H is PageRank’s raw hyperlink matrix. Query-independent 
HITS requires about twice (and as much as four times) as much work per iteration as 
PageRank. (There are other ways to implement modified HITS so that only one power 
method is required, not two. For example, form the modified authority matrix M = L’L, 
where L = L + fee’. However, these methods do not come with the cute mathematical 
properties of our proposed modification. See theorem 11.7.1.) Now to make the compari- 
son complete, let’s discuss the number of iterations required by the four methods. 


There’s one very nice consequence of the modification to HITS that we’ve suggested 
in this chapter. Unlike other modifications [72, 134], ours allows us to say a great deal 
about the spectrum of our modified HITS matrix. Adapting a statement from [82] to our 
particular situation gives the following theorem for the modified authority matrix. Similar 
statements hold for the hub matrix. 


Theorem 11.7.1. Let M = €L7L + (1 — €)/nee” be the modified authority matrix. 
Let 1 > Aq > ++: => Ay be the eigenvalues of L?L and V1 > 72 > ++: = Yn be the 
eigenvalues of M. Then, the following interlacing property holds, 


V1 > ad, > 72 > AA2 2 +++ & Yn = AAn- 
And further, there exist scalars 3; > 0, S~/_, 3; = 1 such that y;, = €; + (1 — €) Gi. 


With this theorem we can now compare the asymptotic rate of convergence of the 
four query-independent methods. See Table 11.2. The bounds for 72/71 are derived by 
examining extreme behavior. In the best case scenario, the modification to L’L increases 
only Az by the maximal amount to A» + 1 — € (ie., Bo = 1,8; = O for all i # 2). In 
the worst case scenario, only 1 increases to A; + 1 — € (ie., G; = 1,8; = 0 for all 
i # 1). In practice, many {,;’s change at once (but 5), 3; = 1), making the effect less 
pronounced than the two extreme cases. Regardless of the exact values for the (;’s, for 
modified HITS, € is usually chosen to be close to 1, so therefore, y2/7, © A2/A1. Thus, 
the asymptotic rates of convergence of HITS and modified HITS are nearly the same. Many 
HITS experiments have shown A2/A1 < .5 [59, 60, 106, 133, 134], which is much less than 
a@ = .85 (the typical PageRank factor), so we can conclude that HITS and modified HITS 
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Table 11.2 Asymptotic rate of convergence of the query-independent ranking methods 


General 
HITS 32 
; A A (=§) 
Modified HITS Ser aier: < 7B 5 ee 
Random surfer PageRank a 
Intelligent Surfer PageRank a 


require many fewer iterations than PageRank. 


So the query-independent HITS takes about twice as long per iteration as query- 
independent PageRank, but takes less than a quarter the number of iterations to reach the 
same tolerance level. Query-independent HITS (even with our theoretically pretty but 
practically slow version of modified HITS, which requires two power methods) is faster 
than the query-independent PageRank. And further, you get two HITS ranking vectors for 
the cost of one PageRank vector. 


11.8 ACCELERATING HITS 


Kleinberg used the power method in his original HITS paper [106] to compute the hub 
and authority vectors, which are the dominant right-hand eigenvectors of L7L and LL’, 
respectively. Computing dominant eigenvectors is an old problem, for which there are 
several available numerical methods, especially for sparse symmetric systems [18, 57, 137, 
145, 162]. 


The problem of computing the original, query-dependent HITS vectors is different 
from that of the PageRank vector because the sizes of the matrices involved are so differ- 
ent. The HITS matrices are small, just the size of the neighborhood graph, whereas the 
PageRank matrix is huge, the size of the search engine’s entire index. PageRank methods 
are limited to memory-efficient methods that are matrix-free and don’t require the storage 
of extra intermediate information, which explains the prevalence of the power method in 
the PageRank literature. On the other hand, faster, more memory-intensive methods can be 
used on the much smaller HITS problem. We don’t know what method a HITS-based com- 
mercial engine like Teoma uses, but we expect it’d be a faster iterative method like Lanczos 
[57, 82], for instance, not the slow power method. The small size of the matrices involved 
in the HITS problem is also one reason why no research has been done on accelerating the 
computation of the HITS vectors—it’s already fast enough. On the other hand, because 
of the enormous size of the PageRank matrix, we spent an entire chapter (Chapter 9) on 
methods for accelerating the computation of PageRank. Of course, if query-independent 
HITS is done, then a large-scale implementation of HITS must handle the same issues that 
PageRank does. And acceleration techniques similar to those of PageRank, for instance 
the extrapolation techniques of Chapter 9, can be adapted to the HITS problem. 


11.9 HITS SENSITIVITY 


Suppose L, the adjacency matrix for the HITS neighborhood graph, changes, creating a 
new matrix L. The question we pose in this section is: how sensitive are the authority and 
hub vectors to these changes in the structure of the web graph? Regardless of the nature of 


THE HITS METHOD FOR RANKING WEBPAGES 127 


the changes, the authority and hub matrices, L?L and LL’, are still symmetric, positive 
definite matrices, which makes the perturbation analysis easier. 


We adapt a theorem from Pete Stewart’s book [152, p. 51] for our specific situation. 


Theorem 11.9.1. Let E be a perturbation matrix, so that L?L =L?L+E. When A, is 
simple, 
2 || Ell 
sin /(x, x) < aes 
It is more appropriate to examine the angle between the old authority vector x and 
the new one x (/(x,X)) than the difference in length (||x — x||) for two reasons: (1) the 
authority vectors are normalized in the HITS procedure, and (2) the ranking of elements 
is important. Theorem 11.9.1 tells us that the separation between the two dominant eigen- 
values governs the sensitivity of the HITS vectors. If the eigengap 6 = 1 — Xz Is large, 
then the authority vector is insensitive to small changes in the web graph. On the other 
hand, if the eigengap is small, the vector may be very sensitive. A similar theorem and 
interpretation exist for the hub vector. 


This theorem only applies when )j is a simple root, which is guaranteed by a mod- 
ified HITS procedure (where an irreducible €L7L + (1 — €)/nee™, or another modified 
matrix [72, 134], replaces LL as the authority matrix). If modified HITS is not done 
and , is a repeated root, then we can examine the sensitivity of the eigenspace associated 
with A;. A result from [153] gives the same conclusion: the sensitivity of the invariant 
subspace associated with the repeated root A, of the symmetric matrix depends primarily 
on the eigengap. 


Let’s consider an extreme (but not uncommon) example that makes it clear why the 
HITS vectors can be sensitive when the eigengap is small. Suppose the neighborhood 
graph contains two separate connected components, so L is completely uncoupled. That 
is, L can be permuted to have the form 


(0 x) 


First, we consider the case when the original unmodified HITS procedure is used. The 
spectrum of the authority matrix is related to the spectrums of the connected components; 
o(L7L) = o(X?X)Uo(Z?Z). The component containing the largest eigenvalue is called 
the largest connected component. The dominant eigenvector of L7 L (thus, the authority 
vector) has nonzero entries only in the positions corresponding to nodes in the largest 
connected component because L’'L has an eigendecomposition of the form 


ti (Ui 9 A, 0 Uf 0 
0 Us, O <A>, 0 Ul): 
That is, the authority vector has the form x = (x; 0 Na The addition of just one link 


that connects the two components can make the authority vector positive, which can sig- 
nificantly change the authority ranking. 


A different perturbation, one that maintains the two separate connected components, 
can more drastically change the authority vector and its ranking. Assume the largest and 
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second largest eigenvalues are in different components, and are not well separated. Sup- 
pose enough links are added to the component with the second largest eigenvalue, com- 
ponent 2, so that this component and its eigenvalue overtake the largest eigenvalue of the 
other component, component 1. The title of largest component is transferred from compo- 
nent | to component 2 and the authority vector now has nonzero entries only for nodes in 
component 2, rather than component 1. 


In this example we began with a completely uncoupled matrix. However, even when 
the modified HITS procedure is used, so that L”'L is irreducible, the authority matrix may 
be nearly uncoupled. Although somewhat disguised by the irreducibility modification, 
sensitivity to small perturbations can exist for the same reasons as in the completely un- 
coupled case. While the modification causes the zero entries in the authority vector to be 
positive, they are close to 0. See the effect of the modification on the numerical example 
of section 11.4. If a perturbation causes the two components to swap the title of largest 
component, then the large entries in the authority vector are swapped from one component 
to the other. (Ng et al. [134] propose a method called Subspace HITS that reduces the 
dominance of the largest connected component in the HITS rankings and tries to spread 
the scores across several connected components.) 


Since we know a good bit about the spectrum of M, the modified authority matrix, 
we might try to say something more specific about how the modification affects the sen- 
sitivity of the system. The eigengap of the unmodified method is given by 6 = Ay — Ao, 
whereas the modified method has an eigengap denoted by p = 71 — 72. Using Theorem 
11.7.1, we have 


f-(1-O <p<E6+(1-8. 


As € — 0, the fudge factor matrix 1/n ee? takes over, creating the uninteresting case 
with an eigengap of | and stable uniform authority and hub vectors. The more interesting 
case occurs when € — 1. As expected, as € — 1, the eigengap of the modified method 
p approaches the original eigengap 6. We can conclude that the modified HITS system 
is about as sensitive as the original HITS system. In summary, modified HITS does not 
significantly affect the rate of convergence or the sensitivity of the system; its only effect is 
on the existence and uniqueness of the HITS vectors. That is, modified HITS is guaranteed 
to converge to unique positive HITS vectors. 


ASIDE: Ranking by Eigenvectors 


PageRank 


and HITS both use the dominant eigenvector as a ranking tool. But this is not a new 
idea, the idea, although much less publicized, has been around for decades. In 1939, Mau- 
rice Kendall and Babington Smith wrote one of the first ranking papers to use linear algebra 
[105]. In order to create a ranking from voter preferences, Kendall and Smith built a pref- 
erence matrix A, where a;; is the number of voters who prefer player i to player j. Here 
a player could be a candidate, team, participant, webpage, etc. The normalized row sums 
r of the preference matrix are a measure of the “winning percentage” of player i. That is, 
r = Ae/||Ael|. (Kendall and Smith also created a coefficient of agreement among voter 
preferences that can be used to locate voters who are inconsistent, and thus should have their 
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scores tossed. This coefficient of agreement can also be used to determine whether the data 
warrant a global ranking—it may be that all voters appear inconsistent, which implies that 
voters have been challenged with the impossible task of ranking indistinguishable objects.) 


T. H. Wei extended the row sum ranking method to include powers of the preference 
matrix A. In his 1952 Cambridge University thesis [161], Wei suggested that the ranking 


vector vr) = A*e/||A”el|, for some integer k. For k = 2, the ranking vector r) gives 
(2) 


some information about the strength of schedule. Using the sports ranking problem, r; 
is the winning percentage of teams defeated by team i. Fork = 3, 73) is the winning 
percentage of teams defeated by teams defeated by team i. And so on. More recently, in 
his 1993 paper [103], James P. Keener showed that many of the early ranking methods fall 
under the Perron-Frobenius theorem. In fact, Wei’s powering idea can be extended so that 
r = limp A*e/||A”el|, which can be arrived at by using the power method applied to 
A with the starting vector e. The power method converges to the dominant eigenvector of A 


provided A is nonnegative and irreducible. 


Much of the art of the ranking problem is in how A is defined. For the problem of 
ranking U.S. collegiate football teams, Keener provides the following possible definitions: 


© ai; = 1, if team? beats team j, 0, otherwise, 
© ai; = the proportion of times i beats 7, 
© aj; = the proportion of football ranking polls that have i outranking j, 


© aij = 8:3 /(Si3 + 8;:), Where s;; is the number of points i scored in encounter with j. 


Keener also extends this to other more complicated scoring schemes, but the common connec- 
tion among all is the Perron-Frobenius theorem and the computation of a dominant eigenvec- 
tor. 


Using the dominant eigenvector for ranking problems has many applications besides 
webpage scoring and the ranking of sports teams. For example, other applications include 
tournament seeding (e.g., for tennis or golf) and handicapping assignments for betting pur- 
poses. The ranking problem has recently been explored for other networks, such as email 
networks among coworkers and networks of connection and communication among suspected 
terrorists. The dominant eigenvector also plays a prominent role in market share statistics and 
models of population dynamics. 
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Chapter Twelve 
Other Link Methods for Ranking Webpages 


The previous chapters dealt with the major ranking algorithms of PageRank and HITS in 
depth, but there are other minor players in the ranking game. This chapter provides a brief 
introduction to the ranking alternatives. 


12.1 SALSA 


In 1998, one could rank the popularity of webpages using either the PageRank or the 
HITS algorithm. In 2000, SALSA [114] sashayed into the game. SALSA, an acronym 
for Stochastic Approach to Link Structure Analysis, was developed by Ronny Lempel 
and Shlomo Moran and incorporated ideas from both HITS and PageRank to create yet 
another ranking of webpages. Like HITS, SALSA creates both hub and authority scores 
for webpages, and like PageRank, they are derived from Markov chains. In this section, 
we teach you the steps of SALSA with an example. 


12.1.1 SALSA Example 


In a manner similar to the original, query-dependent HITS, the neighborhood graph N 
associated with a particular query is formed. We use the same neighborhood graph N 
from the previous chapter, which is reproduced below in Figure 12.1. 


Co Sc 
C2 


Figure 12.1 Neighborhood graph N for pages 1 and 6 


SALSA differs from HITS in the next step. Rather than forming an adjacency matrix 
L for the neighborhood graph N, a bipartite undirected graph, denoted G, is built. G is 
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defined by three sets: V;,, Va, E, where V;, is the set of hub nodes (all nodes in N with 
outdegree > 0), V, is the set of authority nodes (all nodes in N with indegree > 0), and E 
is the set of directed edges in N. Note that a node in N may be in both V), and V,. For the 
above neighborhood graph, 


Vi —= {1, 2, 3, 6, 10}, 
Va = {1, 3,5, 6}. 
The bipartite undirected graph G, shown in Figure 12.2, has a “hub side” and an “authority 


side”. Nodes in V;, are listed on the hub side and nodes in V, are on the authority side. Ev- 
ery directed edge in F is represented by an undirected edge in G. Next, two Markov chains 


3 authority 
side 


ea 


Figure 12.2 G: bipartite graph for SALSA 


are formed from G, a hub Markov chain with transition probability matrix H, and an au- 
thority Markov chain with matrix A. Notice that in this chapter the H matrix is SALSA’s 
hub matrix, not to be confused with PageRank’s raw hyperlink matrix from several chap- 
ters prior. Reference [114] contains a formula for computing the elements of H and A, 
but we feel a more instructive approach to building H and A clearly reveals SALSA’s con- 
nection to both HITS and PageRank. Recall that HITS uses the adjacency matrix L of NV 
to compute authority and hub scores using the unweighted matrix L. On the other hand, 
PageRank computes a measure analogous to an authority score using a row-normalized 
weighted matrix G. SALSA uses both row and column weighting to compute its hub and 
authority scores. Let L, be L with each nonzero row divided by its row sum and L, be L 
with each nonzero column divided by its column sum. For our example, 


12 3 5 6 10 

1 /00101 0 

2 710000 0 

r-3 [9099001 0 

~5 100000 0 ]? 

6 }0 0110 0 
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Then H, SALSA’s hub matrix, consists of the nonzero rows and columns of L,L? and A 
is the nonzero rows and columns of L?L,.. 


1 2 3 5 6 10 12 3 5 6 10 
1/30303 4 1/100 0) 0-0 
2 0 10600 0 0 2 {00000 0 
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As a result, the SALSA hub and authority matrices are 
1 2 3 6 10 

1 5 0 2 3 2 1 3 5 6 
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If the bipartite graph G is connected, then H and A are both irreducible Markov 
chains and oe the stationary vector of H, gives the hub scores for the query with neigh- 
borhood graph N, and 77 gives the authority scores. If G is not connected, then H and A 
contain multiple irreducible components. In this case, the global hub and authority scores 
must be pasted together from the stationary vectors for each individual irreducible compo- 
nent. (Reference [114] contains the justification for the two if-then statements above.) 


Since an undirected graph G is connected if every node is reachable from every other 
node, our graph G from Figure 12.2 is not connected because, for instance, node 2 is not 
reachable from every other node. For bigger graphs, where connectedness cannot be deter- 
mined by inspection, graph traversal algorithms exist that identify both the connectedness 
and the connected components of the graph [54]. Because G is not connected, H and A 
contain multiple connected components. H contains two connected components, C' = {2} 
and D = {1,3,6,10}, while A’s connected components are # = {1} and F' = {3, 5,6}. 
Also clear from the structure of H and A is the periodicity of the Markov chains. All irre- 
ducible components of H and A contain self-loops, implying that the chains are aperiodic. 
The stationary vectors for the two irreducible components of H are 


2 1 3 6 10 
m(C)= (1), m(D)= (3 & § @) 
while the stationary vectors for the two irreducible components of A are 
1 3.5 6 
TAME) (Ae. aot IS. a) 
Proposition 6 of the original SALSA paper [114] contains the method for pasting the hub 
and authority scores for the individual components into global popularity vectors. The 
suggestion there is simple and intuitive. Since the hub component C' only contains 1 of the 


5 total hub nodes, its stationary hub vector should be weighted by 1/5, while D, containing 
4 of the 5 hub nodes, has its stationary vector weighted by 4/5. Thus the global hub vector 


134 CHAPTER 12 


1 2 3 6 10 
4. 1 1 Aly ol. Ae. 4. 1 
m= (3°33 5:1 5:5 3:3 57a) 
1) B® 3 6 10 


(C2007 C2 1833: 2607" :1833 


With similar weighting for authority nodes, the global authority vector can be constructed 
from the individual authority vectors as 


pots 
lOO 
wl W 


= (525. <25-- (12b: 375.). 


Compare the SALSA hub and authority vectors with those of HITS in section 11.4. They 
are quite different. They’re not even the same length and they give significantly different 
rankings for this example. Ranking the pages from most important to least important gives 


SALSA hub ranking =(1/6 2 3/10) 
HITS hub ranking=(1 3/6/10 2 5) 
SALSA authority ranking =(6 1/3 5) 
HITS authority ranking=(6 3 5 1 2/10) 


where the / symbol indicates a tie. 


Our little example is instructive for two additional reasons. First, it shows one way 
to paste the individual component scores together to create global scores. There are, of 
course, other weighting schemes for the pasting process. Second, the presence of mul- 
tiple connected components (which occurs when G is not connected, and is common- 
place in practice) is computationally welcome because the Markov chains to be solved 
are much smaller. Contrast this with PageRank’s artificial correction for a disconnected 
web graph, whereby connectedness is forced by adding direct links between all webpages. 
PageRank researchers Konstantin Avrachenkov and Nelly Litvak have suggested that, sim- 
ilar to SALSA, PageRank be computed on smaller connected components, then pasted 
together to get the global PageRank vector [11]. Of course, in order to implement their 
suggestion, the multiple connected components of the entire web graph must be found 
first. But that’s not so hard—there’s Tarjan’s O(V + E) linear time algorithm [54], where 
V and E are the number of vertices and edges in the graph. Unfortunately, it appears that 
the connected component decomposition for the PageRank problem can have only limited 
potential because researchers have discovered a bow-tie structure to the Web [41], which 
shows that the largest connected component of the Web is over a quarter of the size of the 
entire Web, meaning, at best, the decomposition can reduce the size of the problem by a 
factor of 4. 
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12.1.2 Strengths and Weaknesses of SALSA 


Because SALSA combines some of the best features of HITS and PageRank, it has many 
strengths. For example, unlike HITS, SALSA is not victimized by the topic drift prob- 
lem [26, 114], whereby off-topic but important pages sneak into the neighborhood set and 
dominate the scores. Recall that another problem with HITS was its susceptibility to spam- 
ming due to the interdependence of hub and authority scores. SALSA is less susceptible to 
spamming since the coupling between hub and authority scores is much less strict. How- 
ever, both HITS and SALSA are a little easier to spam than PageRank. SALSA, like HITS, 
also has the benefit of dual rankings, something that PageRank does not supply. Lastly, 
the presence of multiple connected components in SALSA’s bipartite graph G, a common 
occurrence in practice, is a computational blessing. 


However, one serious drawback to the widespread use of the SALSA algorithm is 
its query-dependence. At query time, the neighborhood graph N for the query must be 
formed and the stationary vectors for two Markov chains must be computed. Another 
problematic issue for SALSA is convergence. The convergence of SALSA is similar to 
that of HITS. Because both HITS and SALSA in their original unmodified versions do 
not force irreducibility onto the graph, the resulting vectors produced by their algorithms 
may not be unique (and may depend on the starting vector) if the neighborhood graph is 
reducible [72]. Nevertheless, a simple solution is to adopt the PageRank fix and force 
irreducibility by altering the graph in some small way. 


12.2 HYBRID RANKING METHODS 


Due to the effectiveness of ranking algorithms in aiding web information retrieval, re- 
searchers have proposed many new algorithms for ranking webpages. Most are modifica- 
tions to and combinations of the original three methods of PageRank, HITS, and SALSA 
(26, 36, 53, 60, 71, 88, 120, 134, 142]. In the next section, we discuss one of the most 
original new ranking algorithms, TrafficRank. 


Some recent work attempts to merge the results from several independent ranking 
algorithms. This seems promising because experiments show that often the top-k lists 
of the ranking scores created by different algorithms are very different. This surprising 
lack of overlap is exciting—it suggests that in the future, perhaps medleys of information 
retrieval algorithms (realized through meta-search engines) will provide the most relevant 
and precise documents for user queries [120]. Cynthia Dwork, now of Microsoft, is one of 
the leaders of the field of rank aggregation, the field that studies how to best combine the 
top-k lists from several search engines into one unified ranked list. 


ASIDE: Rank Aggregation and Voting Methods 


The rank aggregation done by meta-search engines is very similar to the aggregation of 
voter preferences. In the case of web search, the political candidates are replaced by webpages 
and each ranked list of pages produced by a particular algorithm takes the place of the ranked 
list of candidates that a voter submits for an election. Given a stack of rank ordered lists, the 
goal in an election is usually to find one overall winner (and possibly a few runner-ups). For 
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meta-search, the goal is to find not just the overall winner but the entire combined ranking, i.e., 
one ranking of all the candidates. Because of its influence on government, voting research, 
also known as social choice theory, has a long history. In 1785, Marie Jean Antoine Nico- 
las Caritat, the Marquis de Condorcet, a French philosopher, mathematician, economist, and 
social scientist, wrote an Essay on the Application of Analysis to the Probability of Majority 
Decisions, in which he revealed the Condorcet voting paradox. In a voting system that studies 
pairwise comparisons, the Condorcet winner, if it exists, is the candidate that beats or ties all 
others in the pairwise comparisons of candidates. Consider an example. Three voters rank 
their preferences for three candidates A, B, and C as follows: voter 1 ranks the candidates A B 
C, voter 2, B C A, and voter 3, C A B. The majority of the voters have A beating B, B beating 
C, and C beating A, which creates a cycle, and thus, a Condorcet paradox because the majority 
tule is in conflict with itself. Many methods have been proposed for resolving the problem of 
cycles in order to declare a Condorcet winner. 


Related to the voting paradox is Kenneth Arrow’s Impossibility Theorem. Arrow, an 
American economist, won the 1972 Nobel Prize in Economics for his work on social choice 
theory. His 1951 doctoral thesis, Social Choice and Individual Values, described five prop- 
erties that every fair voting system should have. He then proved that no voting system could 
satisfy all five properties. Scholars debate Arrow’s conditions, arguing over which are truly 
necessary, which are less important, etc. Nevertheless, his theorem shows that in many situa- 
tions there is no fair, logical way of aggregating individual preferences to accurately determine 
the collective preferences of the voters. Many voting systems now exist for a variety of voting 
situations, and voting systems are judged by various criteria, such as resistance to manipu- 
lation, Condorcet efficiency (the percentage of elections in which the Condorcet winner is 
selected), neutrality, and consistency. 


Understanding the problems with determining a fair voting system that declares one 
overall winner gives an appreciation for how much harder it is to determine a complete ranking 
of all candidates, and thus, how much harder the rank aggregation problem is for web search. 


12.3 RANKINGS BASED ON TRAFFIC FLOW 


The Internet is often called the Information SuperHighway. That image helps describe 
our final ranking method, TrafficRank. Rather than thinking about a lone surfer bouncing 
around the Web (as Google does), imagine millions, or billions, as actually happens in real 
life. Now the Web’s links become highways between pages, which means there’s conges- 
tion and traffic. While these things are unpleasant on the auto highway, they’re useful for 
ranking webpages on the information superhighway. In the auto analogy, if we knew the 
total number of cars on the highways leading into the North Carolina Outer Banks, we’d 
have a measure of how popular the Outer Banks are as a destination. (If you’ve ever waited 
in the backup on Route 12 heading into Nags Head on a Saturday during the summer, you’d 
agree this gives a pretty good approximation to destination popularity.) Actually, the total 
number of cars entering the Outer Banks divided by the total number on all highways gives 
a relative measure of the Outer Banks’ popularity compared to that of other destinations. 
Unfortunately, counting the number of surfers on links on the Web is impossible. (A re- 
lated effort counts the number of surfers on pages, a much more manageable number. See 
the Alexa aside on page 138.) But all is not lost; there is a way to approximate the number 
of surfers on each link using the available graph information. 


Let p;; be the proportion of all web traffic on the link from page 2 to page j. Then 
pi; — 0 if there is no hyperlink from 7 to 7. This definition for p;; means there’s a variable 
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for every hyperlink on the Web. The goal is to estimate these p;;’s, then set )>, pi;, which 
is the proportion of traffic entering page j, as the TrafficRank of page j. The variables p;; 
must satisfy some constraints. First, of course, > ij Pig = 1. Second, assuming traffic 
flow into a page equals traffic flow out of a page, )7, pij — >>, pji = O for every page j. 
Otherwise, the p;;’s are free to take on any values. IBM researcher John Tomlin devised 
the following optimization problem to find the p;;’s for his TrafficRank model [159]. 


max — S- pijlog pi; Subject to 
aj 
Spi = 1, 
tj 
SB; - Se pji =0, for every j, 
i i 


The objective function is the famous entropy function from Claude Shannon’s work on in- 
formation theory [149]. The entropy function maximizes the freedom in choosing the p;,;’s. 
The theory says that the entropy function gives the best unbiased probability assignment to 
the variables given the constraints. It uses only the given information from the constraints 
and is maximally noncommittal with respect to the missing information. 


OK, so just solve the optimization problem to get the p,;’s and form the TrafficRank 
for each page. Problem solved. But wait, you protest, that optimization problem is huge; it 
has || variables where | | is the number of edges in the web graph. True, but Tomlin pro- 
vides a fast iterative algorithm for computing the variables. The algorithm uses Lagrange 
multipliers and impressively exploits the problem’s structure so that solving the optimiza- 
tion problem only takes about 2.5 times longer than solving a PageRank problem for the 
same graph. Tomlin’s results showed that TrafficRank was similar to HITS hub scores in 
the sense that high TrafficRank pages tended to have many outlinks. This similarity to hubs 
makes sense because TrafficRank measures flow through a page, and heavy flow requires 
a large number of both inlinks and outlinks. 


The TrafficRank model has two interesting extensions. First, as more traffic infor- 
mation becomes available, it can easily be added to the model in the form of constraints. 
For instance, if actual data is collected on traffic at popular sites, then constraints of the 
form 


By < So ij < w;,forj € J, 
a 


give an allowable range on the computed TrafficRank values of pages in the set J of pop- 
ular sites. Second, the dual solution of the optimization problem has an interesting in- 
terpretation.! Inverting the Lagrange multipliers (there’s one for each constraint) of the 
primal solution gives a “temperature” for each webpage. (This interpretation comes from 
the thermodynamics relationship between entropy and heat.) As a result, Tomlin used the 
dual measure to form a HotRank for each page. This HotRank was similar to, but generally 
outperformed, PageRank as a measure of authoritativeness. 


'Many optimization problems have both primal and dual formulations whose solutions are related by the 
famous Duality Theorem. 
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Finally, we mention TrafficRank’s connections to our two well-studied ranking al- 
gorithms, PageRank and HITS. The matrix P,,.., = [pis] (where 7 is the number of pages 
in the index) formed from the solutions to the TrafficRank optimization problem is sparse, 
nonnegative, and substochastic. Of course, the Perron vector (the dominant eigenvector) 
could be computed for this matrix and compared with the query-independent HITS vec- 
tor. Similarly, the Traffic Rank matrix P could be row-normalized so that it is stochastic. 
Then the dominant left-hand eigenvector is computed, which, in this case, is actually the 
PageRank vector for an intelligent surfer model. 


ASIDE: Alexa Traffic Ranking 


Alexa, an amazon.com search company, uses its Toolbar to gather information about 
web usage, which in turn produces popularity rankings based on site traffic. As the Alexa 
website says, “the more people [that] use Alexa [specifically its Toolbar], the more useful it 
will be.” Alexa makes other use of their collected data. For example, there’s a list of Movers 
and Shakers, the top ten websites with the most dramatic increase or decrease in their traffic 
ranking during the past week. There’s also a list of the 500 most popular sites according to 
Alexa. And, there’s the traffic ranking plot of popular webpages over time. Figure 12.3 shows 
the traffic rank trends for Yahoo!, Google, AltaVista, and Teoma from February to August 
2004. 
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Figure 12.3 Alexa Traffic Rank Trends for 4 Search Engines 


According to Alexa users, Yahoo! and Google clearly see more traffic than AltaVista 
and Teoma. Alexa is also the company that supplies the Internet Archive (see the box on page 
21) with its regular donation of pages. 


Chapter Thirteen 


The Future of Web Information Retrieval 


Web search is a young research field with great room for growth. In this chapter, we survey 
possible directions for future research, pausing along the way for some storytelling. 


13.1 SPAM 


ASIDE: _ The Ghosts of Search 


Sammy the Spammer had been pecking away at his computer continuously for over 
27 hours. Sammy was used to the sustained bursts of work—he’d been hacking, coding, 
programming, and spamming since he could type at the age of four. He came from a proud line 
of spammers. His older brother was a hacker, the brother before him, a hacker, and so Sammy 
naturally displayed the talent early on. The family was well known in the search engine 
optimization (SEO) community, with a reputation not too far from that of a leading mafia 
family. His family had worked hard (unethically some said) to rule the world of underground 
search rankings. If you needed to knock off a few competitors in the rankings, you came to 
Sammy. The family was well rewarded for their computer skills. Sammy himself had three 
houses—one in the Valley, Silicon, of course, one in Maui, and one in London. 


Sammy had fallen asleep at the keyboard many nights after 20-plus hour days. But this 
time when he woke up, something was different. His vision was blurry, his thoughts muddled. 
He thought he saw a thin, white-haired man dressed in a red gown, a glow about his head, 
standing ten feet in front of him. Sammy blinked twice; the man remained. He’d never seen 
the man before, yet somehow felt as if he might have. He felt strangely unalarmed by the 
apparition. Convinced he was dreaming, Sammy decided to play along with the scene, and 
asked the man, “Are you a spirit?” “I am,” came a gentle reply. “Who and what are you?” 
Sammy pried. “I am the Ghost of Search Past.” “Long past?” Sammy asked. “No, your past,” 
the spirit said. 


The ghost held Sammy’s arm as they whisked by the scenes of his past. Sammy saw 
a young boy getting an award at a science fair. Sammy remembered the project—he’d built 
a web crawler to find and connect the webpages of other young inventors on the Web. Next, 
Sammy saw his 13-year-old self sitting at a table in a pizzeria talking with his older brothers. 
They’d just taken little Sammy to his first SEO conference. The trio was buzzing about the 
financial potential of the Web. They sat making plans for making money on the Web. Wit- 
nessing that conversation now, Sammy felt the same excitement he’d felt years before. He 
was jolted back to the present. The Ghost of Search Past said, “My time grows short. Quick.” 
Sammy witnessed one last scene. Sammy and his brother sat in an office listening intently as 
their older brother took a phone call from their friend Paul. That was the conversation where 
Paul warned them to learn from his mistakes; he’d just lost a legal battle with the search engine 
Anetta, and, consequently, his business. 
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Suddenly Sammy found himself back in his room. The Ghost of Search Past was gone 
but another one had arrived. “You must be the Ghost of Search Present,’ Sammy said. “Yes, 
but we haven’t any time for me; we need to move this story along,” the spirit said. There 
was a flash, darkness, then when Sammy could see again, he found himself standing in a 
cemetery next to a shrouded, dark spirit, the Ghost of Search Future. The ghost pointed at 
a headstone—PageRank 1998-2006. “What happened?” Sammy asked. “PageRank ruled 
search. Before PageRank, web search was elementary. That algorithm changed everything. I 
did all my projects from my keyboard, I hardly had to leave my room, thanks to PageRank. 
What happened?” Sammy asked again. The spirit handed him a PDA. On it was an obituary. 


Obituary: Born in 1998, PageRank is succeeded by parents Larry Page and Sergey 
Brin. Died on November 27, 2006. After a long, hard-fought battle with link spam- 
mers, PageRank finally succumbed to ... 


The PDA slipped from Sammy’s hand as he blankly turned to the spirit. “Tell me truly, 
Spirit, did I do this? Could I have changed the course of this algorithm’s life?” There was no 
reply. Of course not. Sammy knew future spirits never spoke. Sammy slowly scanned the 
graveyard. He saw headstones for other algorithms he’d known; HITS, SALSA, TrafficRank. 
It was too much at once. Sammy begged to go back; he pleaded with the silent spirit. And 
snap, back to reality. Sammy awoke in his room in front of his keyboard. 


The story, the Ghosts of Search, might not be too outlandish. In fact, it was inspired 
by a recent weblog posting. On May 24, 2003, Jeremy Zawodny declared PageRank dead. 
He claimed the algorithm was no longer useful because bloggers and SEOs had learned 
too much about it and had, in effect, changed the nature of the Web. Since PageRank is 
based on an optimistic assumption that all links are conceived in good faith with no ulterior 
motives, an assumption that no longer holds, then PageRank is no longer useful. The blog 
article “PageRank is Dead” inspired many interesting rebuttals. We are certain (private 
communication) that PageRank is not dead. It’s still a major part of the Google technology, 
but just one part—new additions and refinements are constantly made. Nevertheless, while 
spam may not have killed PageRank completely, it has initiated a lot of damage control. 
In fact, spam is a major area of research for all search engines. New search engines turn 
heads when they back up claims that their algorithms are impervious to spam. 


Creating spam-resistant ranking algorithms is a current goal. But in the meantime, 
many engines settle for simply identifying spam pages, which they can then devalue sig- 
nificantly after the ranking computation. Spam identification probably isn’t any easier than 
starting from scratch, trying to create a new, spam-resistant algorithm. But it’s a route many 
engines are taking due to the personal attachment and resources they’ve already invested 
in their existing ranking algorithms. 


One algorithm for identifying link spam uses the structure of link farms and link 
exchanges, the primary means for boosting rankings, to identify pages participating in link 
spam. Specifically, the algorithm considers each page one at a time, and asks, “What pro- 
portion of this page’s outlinking pages point back to it?” (In other words, what percentage 
of a page’s links are reciprocal?) If the answer is greater than some threshold (say 80%), 
then that page is identified as a page likely to be participating in link spam. The identified 
page is then sent an email similar to the following message. 


You’ve been identified as a link spammer. Your pages will be removed from 
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our index unless you immediately remove all links to fellow spammers partic- 
ipating in an exchange or farm program. 


Search engine professionals cite an added bonus to the method: a spammer who has 
been caught often rats on his fellow spammers participating in the same exchange program 
to make sure they don’t reap the benefits of his lost business. Noticing which links the 
identified spammer then removes also helps identify other potential spammers. However, 
this simple algorithm has a few drawbacks. First, it’s a tedious computation that must be 
done for each page in the index. Second, it’s not foolproof. Consider the following email 
invitation that we received from a smarter link spammer (or perhaps one who’d been caught 
once before). 


Hello, 

We offer accommodation services and I thought you might be interested in link 
exchange. We provide several travel-related sites. All of them are PageRank 6. 

Due to the possible harming nature of too many reciprocal links we suggest non 
reciprocal links. You can link to us from your site and we will link back from another 
of our sites. 

If you got this message in error please forward this mail to your webmaster. 

I look forward to hearing from you. 

Best Regards, Mark 


Another idea for deterring link spam is to build a score that is the “opposite” of 
PageRank. It’s called BadRank (http: //pr.efactory.de/). PageRank is a measure of 
how good a page is, as measured by the quality of pages that point to it. Since goodness 
does not mean the absence of badness, we can also give every page a BadRank score 
that measures how bad the page is. The BadRank thesis is: a page is bad if it points to 
other bad pages. BadRank is an outlink propagation whereas PageRank propagates along 
inlinks. PageRank and BadRank can be combined to give an overall fitness score to each 
page. Andrei Broder and his IBM colleagues presented a similar idea [15] at the 2004 
World Wide Web conference in New York City. Their method creates a PageRank-like 
algorithm for penalizing pages that point to dead pages, which are abandoned sites. 


Some claim that, in the long run, the best spam deterrent may be the most obvious— 
simply offer search engine optimizers an alternative way to boost their rank. Rather than 
crawling their way up the rankings by haggling with competitors over link reciprocation, 
let them buy their way to the top. The price of cost-per-click advertising, which is cheap for 
more specific, less popular queries, often outweighs the effort and stress associated with 
link spamming. However, since sponsored links don’t carry the authority that pure links 
do in the list of results (and many users ignore them), some SEOs are willing to invest the 
time to link spam their way into the list of pure results. 


Well, there’s egg and bacon; egg sausage and bacon; egg and spam; egg bacon 
and spam; egg bacon sausage and spam; spam bacon sausage and spam; 
spam egg spam spam bacon and spam; spam sausage spam spam bacon spam 
tomato and spam.—Monty Python Spam Skit 


Like the Monty Python skit, it seems we just can’t escape spam on the Web. Spam 
has clearly become an increasingly challenging problem. In the future, we predict the best 
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search engines will be the ones with entirely new ranking algorithms that were devised 
from the start to handle the issues of spam. 


13.2 PERSONALIZATION 


In section 5.3, we talked about personalized search where the motto is to let you “have it 
your way” with regard to the rankings of search results. Google’s Personalized Search (in 
beta in Google Labs) lets you do this, up to the granularity provided by their check box 
categories of user interests. (See Section 5.3 and the box on page 51.) However, there’s 
a newer company that offers much more personalization. A9 (www.a9.com), an Amazon 
company, bases its search results on a simple idea. Whenever you are trying to find some- 
thing, especially something you’ve lost, try retracing your steps. A9 keeps track of your 
steps for you. The search engine automatically records a detailed history of your search 
life: pages you’ve visited, when you visited them, how often, and what queries you’ve 
attempted in the past. A9 results pages also come with a “site info” button, which contains 
statistics such as Amazon’s average traffic rank for that page, lists of customer reviews, on- 
line birth date, number of inlinks, plus Amazon’s famous recommendation system: “people 
who visited this webpage also visited ...” It seems A9 is part of a growing trend—there will 
be even more personalization for web users in the future. 


13.3 CLUSTERING 


The major search engines spend a lot of energy improving their ranking algorithms. They 
are constantly tweaking their rankings because they know that users look only at the first 20 
results. It’s important, in order to maintain user loyalty, that these be the best, most relevant 
pages. However, some newer search companies believe that only modest gains are to be 
had by these ranking refinements. No matter how hard you try, you just can’t pack more 
than 20 highly relevant pages into the top 20 results. Instead, these companies abandon 
the fixation with ranking one list and work on creating hierarchical clusterings of results. 
These clusters help users drill down and quickly find the most appropriate category. This, 
in turn, helps with query refinement, the process of submitting a slightly revised query 
based on the prior search results. Teoma, which uses the HITS-based algorithm, actually 
has a third set of results, in addition to the hub and authority lists we mentioned in Chapter 
11, called the Refine List, which contains categories associated with the query. 


Along these lines, the rising meta-search engine Viv“isimo is trying to set “a new 
standard for the way document collections are organized.” Viv“isimo was founded in 2000 
by computer scientists at Carnegie Mellon University. On the left-hand side of the results 
page are hierarchical category folders. For example, try a query on “Kerri Walsh,” the taller 
half of the May/Walsh pro beach volleyball pair, which recently won the gold medal in the 
Athens Olympic games. Viv“isimo finds 153 results: 47 are grouped under the Gold cat- 
egory, 20 under AVP (Association of Volleyball Professionals), 8 under Youngs/McPeak 
(May/Walsh’s toughest competitor), 5 under Misty (Walsh’s doubles partner, Misty May), 
and so on. The right-hand side looks like the results from a standard search engine like 
Google or AltaVista. That is, the 153 results are listed from most relevant to least relevant 
regardless of category. Viv“isimo technology is not limited to the Web; they recently cre- 
ated a special tool to help the media and public find information quickly in the 570+ page 
report of the 9/11 Commission. (You can give it a try at http: //vivisimo.com/911.) 
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ASIDE: Kartoo Clustering 


The French meta-search engine KartOO (http://www. kartoo.com/) is a really 
fun tool. It’s like an artist’s rendition of the Vivisimo results. KartOO displays clustered 
search results both on the left-hand side in a list as well as visually on an interactive map. 
Notice in Figure 13.1 how the results of the example query of “Kerri Walsh” brighten up. 
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Figure 13.1 Sample map of KartOO results 


Unfortunately, this screen shot does not allow interactivity, so you can’t see the links 
between topics and webpages that would appear as your mouse scrolls over the map. Clicking 
on any topic in the map automatically refines the query, biasing the revised results toward that 
topic. KartOO is at the other end of the spectrum of search engines. Most search engines give 
users a simple clean list of ranked results, assuming users lack time, effort, and discrimination. 
KartOO instead fills the page with as much information as possible and allows users to sort 
through the pages creating new connections as they proceed. Which type of display do users 
prefer? Depends on who you ask and when you ask them. Sometimes you’re in a hurry and 
want the search engine to do all the work, and sometimes you have the time to play around 
and discover things for yourself. 


13.4 INTELLIGENT AGENTS 


ASIDE: The First Brain Implant 


NY Times—June 30, 2009. Yesterday 36-year old Larry Page woke up from surgery 
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still feeling a little groggy. His first post-surgery words were “I’m hungry.” The reporters 
were hoping for something a little more prophetic, but the event itself is news enough. Just 
12 hours prior, the Google cofounder and owner, in a bold public relations move, became the 
first person to undergo a radical new surgery—the Google brain implant. 


It was only six years ago when Page, speaking of the future of search, said, “On the 
more exciting front, you can imagine your brain being augmented by Google” [135]. What 
progress for mankind in such short time. Marjorie States already knows how she’d use a 
Google implant. States loved using the GPS system in her Acura to find restaurants and 
directions around town when she lived in Poughkeepsie. Since she moved to New York City 
three months ago, she’s been walking around town or using the subway instead of the car. 
She can’t wait for a Google brain implant to replace her current on-the-go restaurant locator 
method of dialing 411. “Using 411 on the cell is so 1990s. With the Google implant I'll save 
like 5 blocks of walking and 15 minutes everyday.” You can’t put a value on time lost. 


But not all citizens are thrilled by the scientific achievement. In fact, for months a small 
but passionate group has been lobbying in D.C. for a constitutional amendment to ban brain 
implants of any sort—informational, memory, sensory, audio/visual, etc. While this group 
describes doomsday predictions of mind control, regression of analytical skills, and long-term 
memory damage, others have literally sold the family farm to secure their $200,000 spot on the 
recipient list. These implant hopefuls believe the benefits of improved test scores, increased 
job performance, and general convenience far outweigh the risks. And Dr. Jonas Smith, a 
neurosurgeon from Johns Hopkins University, puts the risks in perspective, “my feeling about 
brain implantation is that only time will tell who is right and who is dead.” Indeed, it’s a very 
scary but exciting time for science. 


While Larry Page’s vision of the future of web search is a bit far-fetched, the story is 
a good introduction to a more realistic vision—one that includes search pets and intelligent 
agents. An intelligent agent is a software robot designed to retrieve specific information 
automatically. The adjective intelligent describes the agent’s ability to run without super- 
vision and learn about your preferences based on your search history, browser cookies, etc. 
Intelligent agents exist already. Many go hunting for new postings on topics you preselect 
like the Google Web Alert (available at Google Labs). Some find the best price for an item 
you want to buy. Others collect and organize your e-mail. 


There’s a futuristic agent that Google’s Director of Technology Craig Silverstein 
calls a search pet. Most searches today are limited to facts. However, according to Sil- 
verstein, that won’t be the case in the future. Because these search pets will be able to 
understand emotions and the way humans interact, people will be able to search for things 
that aren’t necessarily facts. That’s a tall order for a search pet, since most humans have 
trouble understanding how other humans work. Nevertheless, we’ll be seeing much more 
from intelligent agents in the future. 


13.5 TRENDS AND TIME-SENSITIVE SEARCH 


ASIDE: Blogs and Trends 


In May 2004 I attended my first WWW conference, the Thirteenth International World 
Wide Web Conference, held in New York City. Not being a computer scientist, I felt a bit 
out of place as I was outnumbered at least 20 to 1 by computer scientists. During the first 
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presentation, I noticed how different the WWW conference was from the SIAM (Society for 
Industrial and Applied Mathematicians) conferences I normally attend. How rude of my new 
colleagues to check their email and surf the Web on their laptops while the speaker covered his 
material, I thought. I mentioned this fact, that over 80% of the audience (including those in the 
front row) were pecking on their keyboards during the talks, to a friend. At least inattentive 
SIAM attendees sat in the back. He explained that most of the audience members were not 
being rude, but rather very attentive. They were following along, hosting chat rooms about 
the ongoing talk, and surfing for definitions of acronyms. I was impressed with their use of 
technology. 


Here’s how one conversation with a new computer science friend went. “How much of 
a computer geek are you, Amy?” asked Urban. “Not much of one,” I said. “So you don’t have 
a blog!” “No.” “But you have read Slashdot, right?” “Never heard of it,” I said. Urban gasped. 
“The blog Slashdot—News for Nerds. Stuff that matters. Just last week they had this article on 


” 


Throughout the conference, my new computer science friends gladly filled my tabula 
rasa. I soon learned much to amend for my blogging deficits. I learned that blog (rhymes 
with flog) was short for weblog, which is an interactive online diary of time-stamped entries. 
Soon I was curious to surf Salon and Slashdot, blogs with supposedly entertaining stories, 
witty political commentary, and geeky must-read news. I also learned that blogs are easy to 
start and maintain. (Anybody’s brother, with the help of software such as Blogger, Radio 
Userland, or Live Journal, can host his own blog.) I learned that most blogs have a blog roll, 
which is a list of other blogs the author recommends. Blogs contain lots of links so readers 
can follow conversations across different blogs. Blogs are often organized by threads, which 
are strings of comments on the same topic. I learned that some blogs have daily devotees, 
while most others are read by a handful of fans. I also learned that, for the most part, I could 
care less about the information contained in blogs. Most blogs serve as a creative outlet for 
wannabe artists, writers, poets, political commentators, and the like. Every day Uncle Pete 
in Franklin, Michigan can tell his family (and the world) what he thinks about his 1980 Ford 
truck. Despite this, a precious few blogs do contain information that serves a community’s 
needs and provides useful archival potential. That observation led me to the most important 
thing I learned about blogs all week: searching blogs is an interesting new research area. 


There are several issues when it comes to searching blogs. For example, should blog 
results be listed in the search engine’s list of results or are blogs really a different beast? Since 
most blogs contain little or no information, most people think they should not be mixed in 
with the standard search results. But blogs aren’t completely useless, either. For example, 
if you need to know how to install replacement bulbs for the headlights on your 1980 truck, 
then you’d be interested in searching the pictures and postings by Uncle Pete of Franklin, 
Michigan. Perhaps, instead, blogs should be searched in a separate domain, similar to the way 
Google News searches just within news sites. That’s the prevailing feeling because blogs really 
are different from most webpages. Blogs are updated even more frequently than ordinary 
webpages, and blogs contain a time stamp that can be very helpful in searching for time- 
sensitive information. Blogs are also link-rich and content-poor. Blogs are full of links like 
“check out this cool page” or “here’s a great article” interspersed with a sentence or two 
of commentary. This means traditional information retrieval scores have trouble identifying 
topics when pages contain so little content. But it also highlights the editorial nature of blogs. 
Blogs contain short snippets of personal opinions, shared and conflicting, whereas news sites 
contain one aggregate opinion presented by the author. These personal opinions may be very 
helpful when you’re deciding whether to buy the 10 GB iPod or the 20 GB iPod. Technorati, 
www. technorati.com, is one search engine that keeps track of the blogspace, the world 
of blogs, by watching over 3 million blogs and 470 million links. For example, Technorati 
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tracks interesting opinions on the top books and news stories. 


Eytan Adar and his colleagues at the Hewlett-Packard Information Dynamics Lab 
have created an algorithm for ranking pages in the blogspace [8]. They rank blogs by 
their so-called epidemic importance, that is, their ability to spread information quickly. 
Their algorithm, called iRank, is very similar to PageRank. But there are two essential 
differences. First, the original link graph for the blogspace, called the explicit graph, is 
augmented by what they call implicit links. An implicit link between two blogs that are 
not explicitly connected by a hyperlink is made if an implicit reference between the two 
blogs is found in the text of one of the blogs or if the text and link similarity between 
the two is high. For instance, Andy’s blog might say “Brian’s blog has an interesting post 
about the new Elmo stuffed toy. You can buy the toys at websitel or website2.” Explicit 
links from Andy’s blog are made to the two online stores, but no explicit link is made to 
Brian’s blog. However, there’s a clear connection between the two blogs, and readers of 
one probably read the other. Adar’s algorithm uses text analysis to find and add these 
implicit links to the blog graph. The second distinguishing feature of iRank is its temporal 
factor. All links are weighted by their freshness. A link’s weight is inversely proportional 
to the difference in dates between the two blogs. Thus, a blog is rewarded for citing recent 
postings on another blog. At this point, ordinary PageRank is run on this augmented, time- 
weighted graph, giving an iRank vector that contains the ranking for the blogs. Adar et al. 
found that iRank results differ substantially from PageRank results. Blogs with high iRank 
tend to be portal pages or pages aimed at finding the most current information, whereas 
blogs with high PageRank usually contained original authoritative material. Depending on 
the search goal, one ranking may be more valuable than the other. 


The use of time as a discriminating factor in search is relatively new. Some informa- 
tion on the Web such as blog postings and news articles does come with an explicit time 
stamp. In other cases, time-sensitive information can be extracted implicitly. For exam- 
ple, the Internet Archive gets an approximation of dates for revisions to webpages with its 
periodic crawls. The Recall Machine from the Internet Archive as well as Google Groups 
allow search for information posted within specific time frames. This feature allows for 
very focused queries. For example, with the time-sensitive search capability, it’s easy to 
compare the tone and content of articles written within a month of the September 11, 2001 
tragedy with those written three years later. 


13.6 PRIVACY AND CENSORSHIP 


Deciding which pages to index is not as simple as it once was. In the Web’s early days, the 
goal was simply to index as many pages as possible. Now search companies must be more 
judicious. They must also consider the privacy of users. For example, spiders must care- 
fully obey all robots.txt files. Similarly, deciding which pages to retrieve for user queries 
is complicated by issues like user safety. Because children have easy access to search en- 
gines, most companies have added safe search filters to their offerings. These issues and 
the two asides below demonstrate that the leaders of search companies must be critical 
thinkers and students of the liberal arts. They routinely face philosophical, ethical, polit- 
ical, business, and legal issues far afield from their graduate studies in computer science, 
engineering, or mathematics. 
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ASIDE: Google’s Cookie 


Privacy advocates think Google’s toolbar and Gmail are a nightmare. These privacy 
hounds despise Google’s “immortal” cookie, which collects the IP address, time, date, and 
search terms of Toolbar users and does not expire until 2038. The lengthy expiration date is 
evidence enough for privacy hounds that Google cannot be trusted. However, Google could 
make good use of some of the collected information. For example, they could use the IP 
address to augment their new local search service, by sorting results for some queries (such 
as those for businesses, addresses, phone numbers, etc.) by proximity to the location of the 
user’s computer. Many Toolbar users are calmed by the fact that the amount of aggregated 
data that Google collects makes individuals nearly anonymous. Especially cautious users can 
turn off some features of the Toolbar to restrict Google’s data collection if they desire. 


ASIDE: _ Search in China 


In early September 2002, the Google homepage was inaccessible in China. A user 
entering the Google URL was rerouted to Tianwang Search, a search engine operated by 
Peking University. Google was blocked because its searches could return links to pornogra- 
phy, democratic forums, content associated with the banned spiritual movement Falun Gong, 
and information endangering national security. The Great Firewall of China, a reference to the 
government’s open attempts to control web content by blocking foreign news sites and forcing 
domestic sites to remove unwholesome content, has been in place since the birth of the Inter- 
net. However, this was the first time censors had hijacked a search engine domain name and 
rerouted traffic to another site. One week later, AltaVista was blocked as well. Apparently, the 
volume of complaints by Chinese surfers was enough to lift the block. Within a few weeks, 
access to Google and AltaVista was restored. Human rights groups have written letters to the 
CEOs of Google and AltaVista requesting that they fight the Chinese censorship. Often the 
search engines, Yahoo! is an example, have voluntarily signed pledges in support of Chinese 
censorship policies, and therefore offer a limited service in order to remain accessible. Search 
engines must weigh the cost and benefits of no accessibility versus limited accessibility. 


13.7 LIBRARY CLASSIFICATION SCHEMES 


During the 20th century, libraries underwent a transformation in their classification and 
presentation of books. The Dewey decimal classification (DDC) system, introduced in 
1876 by Melvin Dewey, for the library at Amherst College, was revised and refined, so 
that today in its 21st edition, it is one of two popular classification systems in use. The 
other alternative system is the Library of Congress classification (LC) system. Because 
these systems enjoy worldwide use—for example, the Dewey decimal system is currently 
used in over 135 countries—it’s natural to think of classifying webpages in a similar man- 
ner. Some groups are trying to encourage users to use either the DDC numbers or LC 
numbers in metatags. Yet, if a strong connection between webpages and traditional li- 
brary classification systems is to develop, the job will probably fall on the crawlers and 
indexers of web search engines. If DDC or LC numbers are associated with each webpage 
in the future, then surfers are a short link away from accessing Amazon or digital libraries 
for information on related books. 
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ASIDE: Google’s Digital Library Initiative 


In December of 2004, Google announced its decade-long initiative to scan millions of 
books from the collections of major research universities. Harvard, Michigan, Stanford, and 
Oxford are among the cooperating universities, as well as the nonacademic New York Public 
Library. The ultimate goal is to allow surfers to search through text in books online. For 
books still under copyright protection, only brief snippets of text and reference information 
will appear. 


However, several publishing companies are not excited about Google’s new initiative. 
These companies prefer that their books and series not be included in the collection. Scan- 
ning a book is a clear violation of copyright law, and is allowed only with permission. Most 
publishers will grant such permission, however, they just want be asked first. In effect, the 
publishers are sending Google a warning message, that the search giant needs to respect the 
rules of this long-standing profession. In the meantime, Google has a huge stack of “May I” 
permission letters that need to be signed. 


13.8 DATA FUSION 


A new type of web retrieval application based on maps is the latest technology. The Where 
2.0 conference assembles researchers and developers in location-based technology. The 
idea is to layer advanced user-friendly interactive search features on top of the familiar 
visual of a map. For example, the Swiss search engine, search.ch, which recently won the 
Best of the Swiss Web Prize, places icons of restaurants, movie theatres, bus stops, park- 
ing garages, hotels, and the like on satellite maps of Switzerland. Scrolling over an icon 
shows details, such as the number of minutes until the next bus, the number of open seats 
in the theatre, the number of open spots in a parking garage, ticket prices, and phone num- 
bers. To achieve such up-to-date information, the engine periodically crawls the associated 
websites for the relevant information. By fusing data from other sources, such as phone 
directories and restaurant guides, search.ch provides a handy visual tool. In fact, visiting 
www.map.search.ch allows you to take a virtual tour of the country. With eventual 
cell phone and PDA accessibility, travel especially will be easy. 


Chapter Fourteen 


Resources for Web Information Retrieval 


14.1 RESOURCES FOR GETTING STARTED 


If you’re a student or a researcher new to the field, you'll find these resources helpful for 
getting started. The datasets are small and manageable, the code simple, and the algorithms 
run quickly. 


14.1.1 Datasets 


There are several small web graphs that are available for download. The table below pro- 
vides details. 


Table 14.1 Small web graphs 


Dataset # pages | # links | Available at 
movies 451 713 website | 
censorship | 562 736 website | 
abortion 1693 4325 website | 
genetics 2952 6485 website | 
EPA 4772 8965 website 2 
Hollins 6012 23875 | website 3 
California | 9664 16150 | website 2 


Most of these webpages also contain other graphs that are similar in size and source. 
For example, Panayiotis Tsaparas hosts a nice webpage (website 4) that contains more 
graphs (and some C code). 


Website 1: http: //www.cs.toronto.edu/~tsap/experiments/datasets 
/index. html 

Website 2: http: //www.cs.cornell.edu/Courses/cs685/2002fa/ 

Website 3: http: //www.math.vt.edu/people/kemassey/ir/ 

Website 4: http: //www.cs.toronto.edu/~tsap/experiments/download 
/download.html 
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14.1.2 Crawlers 


On page 17, we provided Cleve Moler’s Matlab code for creating your own datasets. This 
m-file can be downloaded from the website for Cleve’s new book Numerical Computing 
with Matlab, http: //www.mathworks.com/moler/ncemfilelist.html. This m-file 
routine can be used to create small, tailored datasets. However, it can be slow and it does 
have some documented problems, e.g., it can stall waiting to download pages with images 
or data files. 


14.1.3 Code 


Matlab is a great tool for programming algorithms and testing ideas on reasonably sized 
datasets. This book contains Matlab code for many algorithms, such as the PageRank 
and HITS algorithms. Other programmers have also produced Matlab code for these link 
analysis problems. See, for example, the following websites: 


e http: //www.stanford.edu/~sdkamvar/research.html#Data 


e http://math.cofc.edu/ langvillea/PRDataCode/index.html 


14.1.4 References 


Extensive lists of references, some hyperlinked, are available at: 


e http://www.cs.cornell.edu/Courses/cs685/2002fa/ 
e http://linkanalysis.wlv.ac.uk/ 


e http://math.cofc.edu/~langvillea/#Current%20Research 


In addition, each year the World Wide Web Conference has several papers related to link 
analysis. 


14.2 RESOURCES FOR SERIOUS STUDY 


When you are ready to move on to bigger problems, consider the tools cited in this section. 


14.2.1 Datasets 
Much larger datasets are available for those interested in more serious study of link analy- 
sis. Table 14.2 gives information for some representative datasets. 


Website 5: http: //www.stanford.edu/~sdkamvar/research.html#Data 
Website 6: http: //cybermetrics.wlv.ac.uk/database/ 
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Table 14.2 Large web graphs 


Dataset # pages # links Available at 
Stanford University sites .28 million | 2.3 million | website 5 
Stanford-Berkeley sites .68 million | 7.6 million | website 5 
23 U.S. University sites 3.0 million | 23.9 million | website 6 


38 Australian University sites | 2.3 million | 19.8 million | website 6 


14.2.2 Crawlers 


There are several nice tools for crawling and collecting link information for large datasets. 
For example, try the following tools: 


e SocSciBot3: http: //socscibot.wlv.ac.uk/ 


e WebBot: information and directions for downloading are available at 
http: //www.math.vt.edu/people/kemassey/ir/ 


e Stanford WebBase Project: http: //www-diglib.stanford.edu 
/~testbed/doc2/WebBase/ 


e WebGraph Graph Compression Tools: http: //webgraph.dsi.unimi.it/ 


14.2.3 Code 


Any serious study of algorithms, one aimed at creating production code, must imple- 
ment algorithms in fortran, C, or C++ rather than a more user-friendly but high-level 
language such as Matlab. In order to compute ranking vectors, many of the link analy- 
sis methods in this book use classic numerical algorithms. Fortunately, effective, efficient 
code is readily available for such classic algorithms. For example, the Netlib repository 
(http: //www.netlib.org/) contains various implementations of the power method or 
other eigenvector methods written in several of the most popular programming languages. 
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Chapter Fifteen 
The Mathematics Guide 


Appreciating the subtleties of PageRank, HITS, and other ranking schemes requires knowl- 
edge of some mathematical concepts. In particular, it’s necessary to understand some as- 
pects of linear algebra, discrete Markov chains, and graph theory. Rather than presenting a 
comprehensive survey of these areas, our purpose here is to touch on only the most relevant 
topics that arise in the mathematical analysis of Web search concepts. Technical proofs are 
generally omitted. 


The common ground is linear algebra, so this is where we start. The reader that 
wants more detail or simply wants to review elementary linear algebra to an extent greater 
than that given here should consult [127]. 


15.1 LINEAR ALGEBRA 


In the context of Web search the matrices encountered are almost always real, but because 
real matrices can generate complex numbers (e.g., eigenvalues) it’s often necessary to con- 
sider complex numbers, vectors, and matrices. Throughout this chapter real numbers, real 
vectors, and real matrices are respectively denoted by Ft, R”, and R™”*”, while complex 
numbers, vectors, and matrices are respectively denoted by C,C”, and C™*”. The follow- 
ing basic concepts of arise in the mathematical analysis of Web search problems. 


Norms 


The most common way to measure the magnitude of a row (or column) vector x = 
(@1,%2,...,2n,) of real or complex numbers is by means of the euclidean norm (some- 
times called the 2-norm) that is defined by 


n 
IIXllp = So leal?. 
i=1 


However, in the applications involving PageRank and Markov chains, it’s more natural 
(and convenient) to use the vector 1-norm defined by 


n 
Ix], = So lal 
4=1. 


because, for example, if p is a PageRank (or probability) vector, (i.e., a nonnegative vector 
with components summing to one) then ||p||, = 1. Occasionally the vector co-norm 


[>], = max [an 
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is used. All norms satisfy the three properties 
|x||>0 where ||x||=0 ifandonlyif x =O, 
||ax|| = Ja] ||x|| for all scalars a, 


|x + y|| < |x|] + lly|| (the triangle inequality). 


Associated with each vector norm is an induced matrix norm. If A ism x n and x 
is n x 1, and if ||*||, is any vector norm, then the corresponding induced matrix norm is 


defined to be 
|All. = max, | Axl). 


The respective matrix norms that are induced by the 1- 2-, and oo- vector norms are 


|All], = max yx |a;;| = the largest absolute column sum, 


a 


All, =VAmax, Where \max = largest eigenvalue of ATA 
|All. gest cig 


(replace transpose by conjugate transpose if A is complex), 


||A]|,, = max S- |a;;| = the largest absolute row sum. 
uv 
J 


The details surrounding these properties can be found in [127]. 


The nice thing about induced matrix norms is that each of them is compatible with 
its corresponding vector norm in the sense that 


| Axl]. < [|All [Pelle 


However, this compatibility condition holds only for right-hand matrix-vector multiplica- 
tion. For left-hand vector-matrix multiplication, which is common in Markov chain appli- 
cations, transposition is needed to convert back to right-hand matrix-vector multiplication, 
and this results in different compatibility rules. If x7 is 1 x n and A ism x n, then 


[x7 Alla < [lx [h1 [Allo x7 Alloo S$ []7 lloo || lla: 


Sensitivity of Linear Systems 


It is assumed that the reader is familiar with Gaussian elimination methods for solving a 
system AmxnXnx1 = bmx1 of m linear equations in n unknowns. If not, read [127]. 
Algorithms for solving Ax = b are important, but the general behavior of a solution to 
small uncertainties or perturbations in the coefficients is particularly relevant, especially in 
light of the fact that the PageRank vector is the solution to a particular linear system. 


While greater generality is possible, it suffices to consider a square nonsingular sys- 
tem Ax = b in which both A and b are subject to uncertainties that might be the result 
of modeling error, numerical round-off error, measurement error, or small perturbations of 
any kind. How much uncertainty (or sensitivity) can the solution x = A~'b exhibit? 


An answer is provided by using calculus. Consider the entries of A = A(t) and 
b = b(¢) to vary with a differentiable parameter ¢, and compute the relative size of the 
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derivative of x = x(t) by differentiating b = Ax to obtain b’ = (Ax)’ = A’x + Ax’ 
(with x’ denoting d « /dt). Taking norms (the choice of norm is not important) yields 
|x| =| A~*b’ — A“*a’x|| < A] + ATA 
<|JAW* | [bil + [AP] AT Mell 


Consequently, 
IIx’ _ Am | Ib = 
|x| |x| 
-1 [bd 1) HAT 
<All] |[A" + ||Al Ae 
lan VAT xl Ta || All 
AB LA 5 (4 LAM) 
(ib [Al Ib) JA 7° 
where & = ||A\| |A-"]| . The terms ||x’|| / ||x||, ||Ib’|| / |||] and || A’|| / || A] represent the 


respective relative sensitivities of x, b, and A to small changes. Because « represents a 
magnification of the sum of the relative sensitivities in b, and A, « is called a condition 
number for A. The situation can summarize the situation as follows. 


Sensitivity of Linear Systems 
For a nonsingular system Ax = b, the relative sensitivity of x to uncertainties or 
perturbations in A and b is never more than the sum of the relative changes in A 
and b magnified by the condition number « = ||A|| ||A~1|]. 


A Practical Rule of Thumb. If Gaussian elimination with partial pivoting is used to solve 
a well-scaled (row norms in A are approximately one) nonsingular system Ax = b using 
t-digit floating-point arithmetic, and if « is of order 10”, then, assuming no other source 
of error exists, the computed solution can be expected to be accurate to at least t — p 
significant digits, more or less. In other words, one expects to lose roughly p significant 
figures. This doesn’t preclude the possibility of getting lucky and attaining a higher degree 
of accuracy—it just says that you shouldn’t bet the farm on it. 


Rank-One Updates 


Suppose that A € #”*” is the coefficient matrix of a nonsingular system Ax = b that 
contains information that periodically requires updating, and each time new information 
is received, the system must be re-solved. Rather than starting from scratch each time, it 
makes sense to try to perturb the solution from the previous period in a simple but pre- 
dictable way. Theoretically, the solution is always x = A~'b, so the problem of updat- 
ing the solution to a linear system is equivalent to the problem of updating the inverse 
matrix A~!. If the new information can be formatted as a rank-one matrix cd”, where 
c,d € R”*!, then there is a formula for updating A~+. 
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Sherman—Morrison Rank-One Updating Formula 


If A,,xn is nonsingular and if c and d are columns such that 1 + d7A~‘c ¥ 0, 
then the sum A + cd@ is nonsingular, and s 
= A-ted?A71 
(A +cd7)* = a7! £ 


— —____.. 15.1.1 
1+d7A-le ( ) 


The Sherman—Morrison formula makes it clear that when a nonsingular system 
Ax = b is updated to produce another nonsingular system (A + cd?)z = b, where 
b,c,d € #”*!, the solution of the updated system is 


A-tcd?A7} 
= T\-1,, _ 1 
z=(A+cd°) b= (a q arate ss) 
A~'cd? A~!b A~'ed? x 


=A'b 


1+d?A—tc ~*~ 14+dTA—c’ 


The Sherman—Morrison formula is particularly useful when an update involves only 
one row or column of A. For example, suppose that the only the i“” row of A is affected— 
say row A,, is updated to become B;,., and let ef = B;. — Ajx. Ife; denotes the i¢” unit 
column (the i*” column of the identity matrix I), then the updated matrix can be written as 


B=A-+ee?, 


so that e; plays the role of c in (15.1.1), and A~'c = A~te; = [A~1],;, the i” column 
of A~+. Consequently, B~* can be constructed directly from the entries in A~! and the 
perturbation vector e” by writing. 

[Aq ").¢7 AT? 

1+ ef [Ams 


Bl=(A+ee;) =A! 


Eigenvalues and Eigenvectors 


For a matrix A € C”*”, the scalars A and the vectors xx; 4 O satisfying Ax = Ax 
are the respective eigenvalues and eigenvectors for A. A row vector y” is a left-hand 
eigenvector if y’ A = dy". 


The set o (A) of distinct eigenvalues is called the spectrum of A, and the spectral 
radius of A is the nonnegative number 


A)= max |Al. 
eS ty a 
The circle in the complex plane that is centered at the origin and has radius p (A) is called 
the spectral circle , and it is a straightforward exercise to verify that 
p(A) < |All (15.1.2) 
for all matrix norms. 


The eigenvalues of A, are the roots of the characteristic polynomial p(A) = 
det (A — AI), where det (x) denotes determinant. The degree of p(A) is n, so, altogether, 
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A has n eigenvalues, but some may be complex numbers (even if the entries of A are real 
numbers), and some eigenvalues may be repeated. If A contains only real numbers, then its 
complex eigenvalues must occur in conjugate pairs—i.e., if A € a (A), then \ € o (A). 


The algebraic multiplicity of an eigenvalue \ of A is the number of times that is 
repeated as a root of the characteristic equation. If alg mult, (A) = 1, then 4 is said to be 
a simple eigenvalue. 


The geometric multiplicity of an eigenvalue \ of A is the number of linearly inde- 
pendent eigenvectors that are associated with \. In more formal terms, geo mult, (A) = 
dim N(A — AI), where N(x) denotes the nullspace or kernel of a matrix. It is always the 
case that geo mult, (A) < alg mult, (A). If geo mult, (A) = alg mult, (A), then d is 
said to be a semisimple eigenvalue. 


The index of an eigenvalue \ € o (A) is defined to be the smallest positive integer 
k such that rank ((A — AI)*) = rank ((A — AI)**"). It is understood that index (A) = 0 
when \ ¢ a (A). 


There are several different ways to characterize index. For A € o(An xn), saying 
that k = index (X) is equivalent to saying that k is the smallest positive integer such that 
any of the following statements hold. 

e R((A —Al)*) = R((A — AI)**"), where R(x) denotes range. 


N((A—AI)‘) = N((A- a where N(x) denotes nullspace (or kernel). 
e R((A—AD*) A N((A = AD) = 
eC? = R((A-AD*) ON((A ie ky, where © denotes direct sum. 


The Jordan Form 


Eigenvalues and eigenvectors are for matrices what DNA is for biological entities, and the 
Jordan form for a square matrix A completely characterizes the eigenstructure of A. The 
theoretical basis for why the Jordan form looks as it does is somewhat involved, but the 
“form” itself is easy to understand, and that’s all you need to deal with the issues that arise 
in understanding Web searching concepts. 


Given a matrix A,,.x», a Jordan block associated with an eigenvalue \ € a (A) is 
defined to be a matrix of the form 


J,(A) = i, , (15.1.3) 
Se 
x 


A Jordan segment J(A) associated with A € o (A) is defined to be a block-diagonal matrix 
containing one or more Jordan blocks. In other words, a Jordan segment looks like 
Ji(A) O --- O 


O Jo(A)--- 0 
J(\) = ; 2 xs : with each J,(A) being a Jordan block. 


Oe 2a) 
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The Jordan canonical form (or simply the Jordan form) for A is a block-diagonal ma- 
trix composed of the Jordan segments for each distinct eigenvalue. In other words, if 
a (A) = {Aj,A2,---, As}, then the Jordan form for A is 


Toa 0) 8 70 
Gu On) ec 20 

Tei) « eee gs (15.1.4) 
0 0. e+ TAs) 


There is only one Jordan segment for each eigenvalue, but each segment can contain several 
Jordan blocks of varying size. The formula that governs the sizes and numbers of Jordan 
blocks is given in the following complete statement concerning the Jordan form. 


Jordan’s Theorem 
For every A € C”*” there is a nonsingular matrix P such that 


P'AP=J (15.1.5) 


is the Jordan form (15.1.4) that is characterized by the following features. 


e J contains one Jordan segment J(X) for each distinct eigenvalue \ € a (A). 
e Each segment J(A) contains ¢ = dim N(A — AI) Jordan blocks. 


e The number of 7 x i Jordan blocks in J(,) is given by 


vi(A) = ri—1(A) — 2ri(A) + igi (A), where r;(A) = rank ((A — AI)'). 


e The largest Jordan block in each segment J(A) is k x k, where k = index (A). 


The structure of J is unique in the sense that the number and sizes of the Jordan 
blocks in each segment is uniquely determined by the entries in A. Two n x n matrices A 
and B are similar (i.e., B = Q-'AQ for some nonsingular Q) if and only if A and B 
have the same Jordan form. 


The matrix P in (15.1.5) is not unique, but its columns always form Jordan chains 
(or generalized eigenvectors) in the following sense. For each Jordan block J,,(A), there is 
a set of columns P,, of corresponding size and position in P = [ +++ |Py|-- | such that 


P, = [ (A = AD) x, | (A— AM) x | | (A = Al) xy | x.] Ears 


for some 7 and some x,, where (A — AI)’ x, is a particular eigenvector associated with 
A. Formulas exist for determining 7 and x, [127, p. 594], but the computations can be 
complicated. Fortunately, we rarely need to compute P. 


An important corollary of Jordan’s theorem (15.1.5) is the following statement con- 
cerning the diagonalizability of a square matrix. 
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Diagonalizability 
Each of the following statements is equivalent to saying that A € C”*” is similar 
to a diagonal matrix—i.e., J is diagonal (all Jordan blocks are 1 x 1). 


e index(\) =1foreachA €a(A)  (ie., every eigenvalue is semisimple). 


e alg mult, (A) = geo mult, (A) for each \ € o (A). 
e A has a complete set of n linearly independent eigenvectors (i.e., each column 
of P is an eigenvector for A). 


Functions of a Matrix 


An important use of the Jordan form is to define functions of A € C”*". That is, given 
a function f : C — C, what should f(A) mean? The answer is straightforward. Suppose 


that A = PJP~!, where J = Je is in Jordan form with the J,’s representing 


the Jordan blocks described in (15.1.3) It’s natural to define the value of f at A to be 
f(A)=PfO)P'=P{ fis.) | Po" (15.1.6) 


but the trick is correctly defining f(J,,). It turns out that right way to do this is by setting 
PO Ge) 


fA) FO) 


2! (k—1)! 
A 1 
bay fA) FQ) 
f(Sx) =f et les pty fe OSL) 
; 2! 
a 

fA) FO) 
FQ) 


Matrix Functions 


Let A € C”*” with 0 (A) = {Aj,A2,.-., As}, and let f : C — C be such that 
FO); f’Os), -., f-Y (,) exist for each i, where k; = index (X;). Define 


(=P Des =f. 4) Po}, (15.1.8) 


where J is the Jordan form for A and f(J,) is given by (15.1.7). 


There are at least two other equivalent and useful ways to view functions of matrices. 
The first of these is called the spectral theorem for matrix functions, and this arises by 
expanding the product on the right-hand side of (15.1.8) expand to yield the following. 
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Spectral Theorem for General Matrices 
If A €C"*” with o (A) = {\j, A2,.--, As}, then 


s kj—-1 wy 
=Soyou Le AI)Gi, (15.1.9) 
le — 0 
where each G; has the following properties. 
e G; isa projector (i.e, G? = G;) onto N((A _ il)™) along R((A _ rill) ). 
e G,+G.+---4+G,=I 
e G,G; = 0 wheni F j. 
e (A—A,IDG; = G;(A — ),J) is nilpotent of index k;. 


The G,’s are called the spectral projectors associated with matrix A. 


Another useful way to deal with functions of a matrix is by means of infinite series. 


Infinite Series Representations 
If oy Ss c;(z— 29)! converges to f(z) at each point in a circle |z— zo| = r, and if 
|\—20| < r for each eigenvalue \ € o (A) , then )°5" 5 cj(A — ol)’ converges, 


and 
= Da cj(A i zgl)4 


If A is diagonalizable—i.e., if is similar to a diagonal matrix 


Mil O-::. O 
0 AI -:. O 
A=P a . |P, 
0 0 As 
then FOE 0 0 
0 fOo)l 0 
f(A)=P as Po 
0 0 = fs) 


and formula (15.1.9) yields the following spectral theorem for diagonalizable matrices 
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Spectral Theorem for Diagonalizable Matrices 
If A is diagonalizable with with o (A) = {A1, Ao,..., As}, then 


A=,G, 4+ A2Go +--- +A5Gs, (15.1.10) 
and 


A= fn Ga fe yGy aoe eer Eisai) 


where the spectral projectors G; have the following properties. 
e G; = G? is the projector onto the eigenspace N (A — A;I) along R(A — 4,1), 


e G.+Go+-::--+G, =I, 
e G,G; = 0 wheni F J, 


B 


k 
© Gi= [[(A-AD/ [Qe Ay) fori =1,2,...5%. 


& 


j#i j#i 


&. 
S 


If \; happens to be a simple eigenvalue, then 
G; = xy*/y*x (15.1.12) 


in which x and y~* are respective right-hand and left-hand eigenvectors associ- 
ated with );. 


Powers of Matrices and Convergence 


A fundamental issue in analyzing PageRank concerns convergence of powers of matrices. 
It follows from (15.1.8) that each power of A € C”*” is given by 


ey ay Ae 
Ak=PpjJ*P-'=P Jk P~', where J, = eee mn) Pe 
r 
and 
dk Cant Gar eae (ee et 
dk Gira 
Je = a = (B)yR-2 (15.1.13) 
yk (aes 
rk mxm 


This observation leads to the following limiting properties. 
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Convergence to Zero and The Neumann Series 
For A € C”*”, the following statements are equivalent. 


e p(A) <1. 


(15.1.14) 

9 Py 
e limgso A” = 0. (15.1.15) 
e The Neumann series series eas Ak converges to (I — Aik (15.1.16) 


It may be the case that the powers A* converge, but not to the zero matrix. The 
complete story concerning lim;_,,, A” is as follows. 


Limits of Powers 
For A € C”*", limz.o. A” exists if and only if p(A) < 1, in which case 
limp +00 A® = 0, or else p(A) = 1, with \ = 1 being semisimple and the only 
eigenvalue on the unit circle. When it exists, 


jim A* = G = the projector onto N (I— A) along R(I— A). (15.1.17) 


Averages and Summability 


With each scalar sequence {a1, 2, @3,...} there is an associated sequence of averages 
{t1, b2, U3,-- } in which 


a1 + 2 ay +ag++++ + An 
=e, ee) ain = 5 ; 
This sequence of averages is called the Cesdro sequence, and when limy—.0 Mn = Q, 
we say that {a,,} is Cesdro summable (or merely summable) to a. It can be proven that 
if {a,,} converges to a, then {j1,,} converges to a, but not conversely. In other words, 
convergence implies summability, but summability doesn’t insure convergence. To see 
that a sequence can be summable without being convergent, notice that the oscillatory 
sequence {0,1,0,1,...} doesn’t converge, but it is summable to 1/2, the mean value of 
{0,1}. Averaging has a smoothing effect, so oscillations that prohibit convergence of the 
original sequence tend to be smoothed away or averaged out in the Cesaro sequence. 


MW =, fP2= 


Similar statements hold for sequences of vectors and matrices, but Cesaro summa- 
bility is particularly interesting when it is applied to the sequence P = { A* }e° , of powers 
of a square matrix A. 
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Summability 
A € C”*” is Cesaro summable if and only if p(A) < 1 or else p(A) = 1 with 
each eigenvalue on the unit circle being semisimple. When it exists, the limit 


eee k—-1 
Jim Botaisst ; eee (15.1.18) 


is the projector onto NV (I — A) along R(I— A). 


Notice that G 4 0 if and only if 1 € o (A) , in which case G is the spectral projector 
associated with \ = 1. Furthermore, if lim,_,,, A” = G, then A is summable to G, but 
not conversely. 


The Power Method 


Google’s original method of choice for computing the PageRank vector was the power 
method, which is an iterative technique for computing a dominant eigenpair (1, x) of a 
diagonalizable matrix A € #”*”™ with eigenvalues 


|Ai| > |Aa| = |A3] = +++ > |Agl- 


For the Google matrix, the dominant eigenvalue is \; = 1, but since the analysis of the 
power method is not dependent on this fact, we will allow A; to be more general. How- 
ever, notice that the hypothesis |A;| > |Az| implies \1 is real—otherwise A (the complex 
conjugate) is another eigenvalue with the same magnitude as 1. Consider the function 
f(z) = (</A1)”, and use the spectral representation (15.1.11) along with |A;/Ai| < 1 for 
i = 2,3,...,k to conclude that 


(+) eee ee creer a ener 


=Gi+ @ Go-pew + (3) G; — G1 as n > oo. (15.1.19) 
1 1 


For every x9 we have (A"x9/Aj') — Gix9 € N(A-— 1D), so, if Gixo # 0, then 
A”xo/A7 converges to an eigenvector associated with \;. This means that the direction of 
A”xo tends toward the direction of an eigenvector because \}} acts only as a scaling factor 
to keep the length of A” xo under control. Rather than using \7’, we can scale A” xq with 
something more convenient. For example, ||A”xo|| (for any vector norm) is a reasonable 
scaling factor, but there are better choices. For vectors v, let m(v) denote the component 
of maximal magnitude, and if there is more than one maximal component, let m(v) be the 
first maximal component—e.g., m(1,3,—2) = 3, and m(—3,3,—2) = —3. The power 
method can be summarized as follows. 
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Power Method 
Start with an arbitrary guess xo. (Actually it can’t be completely arbitrary because 
you need xp ¢ R(A — AjI) to ensure G;xo 4 O, but it’s highly unlikely that 
randomly chosen vector xq will satisfy G1x9 = 0.) It can be shown [127, p. 534] 
that if we set 
JASE = a en = forn=0,1,2,...,  (15.1.20) 


n 


then x, — x andv, — 1, where Ax = A,x. 


There are several reasons why the power method might be attractive for computing 
Google’s PageRank vector. 


e Each iteration requires only one matrix-vector product, and this can be exploited to 
reduce the computational effort when A is large and sparse (mostly zeros), as is the 
case in Google’s application. 


e Computations can be done in parallel by simultaneously computing inner products 
of rows of A with x,,. 


It’s clear from (15.1.19) that, for a diagonalizable matrix, the rate at which (15.1.20) 
converges depends on how fast (A2/Ai)” — 0. As discussed in section 4.7, Google 
can regulate || through the choice of the Google parameter a, so they can control 
the rate of convergence (it’s just assumed that Google’s matrix is diagonalizable). 


e Since A; = 1 for Google’s PageRank problem, there is no need for the scaling factor 
Vy. In other words, the iterations are simply x,41; = AXn. 


Linear Stationary Iterations 


Solving systems of linear equations A,,x.»x = b is a frequent necessity for Web search 
applications, but the magnitude of n is usually too large for direct solution methods based 
on Gaussian elimination to be effective. Consequently, iterative techniques are often the 
only choice, and, because of size, sparsity, and memory considerations, the preferred algo- 
rithms are the simpler methods based on matrix-vector products that require no additional 
storage beyond that of the original data. Linear stationary iterative methods are the most 
common. 
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Linear Stationary Iterations 
Let Ax = b be a linear system that is square but otherwise arbitrary. Writing A 
as A = M — N in which M~! exists is called a splitting of A, and the product 
H = M "Nis called the associated iteration matrix. For d = M~'b and for an 
initial vector x(0), the sequence defined by 


x(k) =Hx(k-1)+d k=1,2,3,... (15.1.21) 


is called a linear stationary iteration. The primary result governing the conver- 
gence of (15.1.21) is the fact that if p(H) < 1, then A is nonsingular, and 


jim x(k) =x=A'b (the solution to Ax = b) for every x(0).  (15.1.22) 


In theory, the convergence rate of (15.1.21) is governed by the size of p(H) along 
with the index of its associated eigenvalue—look at (15.1.13). But for practical work an 
indication of how many digits of accuracy can be expected to be gained per iteration is 
needed. Suppose that H,,,,.,, is diagonalizable with 


o (H) = {Aq,A2,.--, As}, where 1> |Aq] > |Ag| > |As]| > --- > |As| 


(which is frequently the case in applications), and let e(k) = x(k) — x denote the error 
after the k*” iteration. Subtracting x = Hx + d (the limiting value in (15.1.21)) from 
x(k) = Hx(k — 1) + d produces (for large /) 


e(k) = He(k — 1) = H*e(0) = (AFG, + ARGo + --- + AFG) €(0) & AFG 1 €(0), 


where the G,’s are the spectral projectors occurring in the spectral decomposition (15.1.11) 
of H*. Similarly, e(k — 1) + AS ~'Gy€(0), so comparing the i*” components of €(k — 1) 
and e(k) reveals that after several iterations, 


ah 1 1 


nw a foreach 2=1,2,...,n. 
€i(K) Ai] p (A) 


To understand the significance of this, suppose for example that 


le,(k—1)]}=10°% and |e;(k)) =10°? with p>q>0, 


so that the error in each entry is reduced by p — q digits per iteration, and we have 
€i(k) 


p— = logy ® — logio p (H). 


Below is a summary. 


Asymptotic Convergence Rate 
The number R = —log,, p(H), called the asymptotic convergence rate for 
(15.1.21), is used to compare different linear stationary iterative algorithms be- 
cause it is an indication of the number of digits of accuracy that can be expected 
to be eventually gained on each iteration of x(k) = Hx(k — 1) +d. 
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Each different splitting A = M — N produces a different iterative algorithm, but 
there are three particular splittings that have found widespread use. 


The Three Classical Splittings 


e Jacobi’s method is the result of splitting A = D—N, where D is the diagonal 
part of A (assuming each a;; 4 0), and (—N) is the matrix containing the off- 
diagonal entries of A. The Jacobi iteration is x(k) = D~'Nx(k—1)+D~'b. 


e The Gauss-Seidel method is the result of splitting A = (D—L)—U, where D 
is the diagonal part of A (assuming each a;; 4 0), and where (—L) and (—U) 
contain the entries occurring below and above the diagonal of A, respectively. 
The iteration matrix is H = (D — L)~!U, and d = (D — L)~*b. The Gauss- 
Seidel iteration is x(k) = (D — L)~1Ux(k — 1) + (D—L)~'!b. 


e The successive overrelaxation (SOR) method incorporates a relaxation pa- 
rameter w # ( into the Gauss-Seidel method to build a splitting A = M —N, 
where M = w-'D —LandN = (w! -1)D+U. 


It can be shown that Jacobi’s method as well as the Gauss-Seidel method converge 
when A is diagonally dominant (i.e., when |a;;|_ > parr |a;;| for each i = 1,2,...,n.) 
This along with other convergence details can be found in [127]. 


M-matrices 


Because the PageRank vector can be view as the solution to a Markov chain, and because 
I — P is an M-matrix whenever P is a probability transition matrix, it’s handy to know a 
few facts about M-matrices (named in honor Hermann Minkowski). 


M-matrices 


A square (real) matrix A is called an M-matrix whenever there exists a matrix 
B > 0 (ie., bj; > 0) and a real number r > p(B) such that A = rI — B. 


If r > p(B) in the above definition then A is a nonsingular M-matrix. Below are 
some of the important properties of nonsingular M-matrices. 


e A isa nonsingular M-matrix if and only if a;; <0 for alli A j and A~' > 0. 
e If A is a nonsingular M-matrix, then Re(A) > 0 for all A € o (A). Conversely, all 


matrices with nonpositive off-diagonal entries whose spectrums are in the right-hand 
halfplane are nonsingular M-matrices. 


Principal submatrices of nonsingular M-matrices are also nonsingular M-matrices. 


e If A is an M-matrix, then all of its principal minors are nonnegative. If A is a 
nonsingular M-matrix, then all principal minors are positive. 
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e All matrices with nonpositive off-diagonal entries whose principal minors are non- 
negative are M-matrices. All matrices with nonpositive off-diagonal entries whose 
principal minors are positive are nonsingular M-matrices. 


e If A = M — N is asplitting of a nonsingular M-matrix for which M~! > 0, then 
the linear stationary iteration (15.1.21) converges for all initial vectors x(0) and for 
all right-hand sides b. In particular, Jacobi’s method converges. 


15.2 PERRON-FROBENIUS THEORY 


At a mathematics conference held a few years ago our friend Hans Schneider gave a mem- 
orable presentation titled “Why I Love Perron—Frobenius” in which he made the case that 
the Perron—Frobenius theory of nonnegative matrices is not only among the most elegant 
theories in mathematics, but it is also among the most useful. One might sum up Hans’s 
point by saying that Perron—Frobenius is a testament to the fact that beautiful mathematics 
eventually tends to be useful, and useful mathematics eventually tends to be beautiful. The 
applications involving PageRank, HITS, and other ranking schemes [103] help to under- 
score this principle. 


A matrix A is said to be nonnegative when each entry is a nonnegative number 
(denote this by writing A > 0). Similarly, A is a positive matrix when each a,; > 0 (write 
A > 0). For example, the hyperlink matrix H and the stochastic matrix S (from Chapter 
4) that are at the foundation of PageRank are nonnegative matrices, and the Google matrix 
G is a positive matrix. Consequently, properties of positive and nonnegative matrices 
govern the behavior of PageRank, and the Perron—Frobenius theory reveals these properties 
by describing the nature of the dominant eigenvalues and eigenvectors of positive and 
nonnegative matrices. 


Perron 


So much of the mathematics of PageRank, HITS, and associated ideas involves nonnega- 
tive matrices and graphs. This section provides you with the needed ammunition to handle 
these concepts. Perron’s 1907 theorem provides the insight for understanding the eigen- 
structure of positive matrices. Perron’s theorem for positive matrices is stated below, and 
the proof is in [127]. 
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Perron’s Theorem for Positive Matrices 
If Anxn > O with r = p(A), then the following statements are true. 
Il, > O, 
2. r€a(A)_ (riscalled the Perron root). 
3. alg mult, (r) =1 (the Perron root is simple). 
4 


. There exists an eigenvector x > O such that Ax = rx. 
5. The Perron vector is the unique vector defined by 


Ap =rp, p > 0,||pll], =1, 


and, except for positive multiples of p, there are no other nonnegative eigenvec- 
tors for A, regardless of the eigenvalue. 


6. r is the only eigenvalue on the spectral circle of A. 


7. r=maxxcy f(x), (the Collatz—Wielandt formula), 


where f(x) = min a and N = {x|x > 0 withx 4 0}. 


1<i<n ; 
2; 40 4 


Extensions to Nonnegative Matrices 


Perron’s theorem for positive matrices is a powerful result, so it’s only natural to ask what 
happens when zero entries creep into the picture. Not all is lost if we are willing to be 
flexible. The next theorem (the proof of which is in [127]) says that a portion of Perron’s 
theorem for positive matrices can be extended to nonnegative matrices by sacrificing the 
existence of a positive eigenvector for a nonnegative one. 


Perron’s Theorem for Nonnegative Matrices 
For Anxn > 0 with r = p(A), the following statements are true. 
e réo(A), (but r= 0 is possible). 
e There exists an eigenvector x > O such that Ax = rx. 


e The Collatz—Wielandt formula remains valid. 


Frobenius 


This is as far as Perron’s theorem can be generalized to nonnegative matrices without 
“3 ; 01 F : 

additional hypothesis. For example, A = ( a shows that properties 1, 3, and 4 in 

Perron’s theorem for positive matrices do not hold for general nonnegative matrices, and 
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A= ( i ; ) shows that property 6 is also lost. Rather than accepting that the major issues 


concerning spectral properties of nonnegative matrices had been settled, F. G. Frobenius 
had the insight in 1912 to look below the surface and see that the problem doesn’t stem 
just from the existence of zero entries, but rather from the positions of the zero entries. For 
example, properties 3 and 4 in Perron’s theorem do not hold for 


1 0 : 1 1 
A=(; |) «ut they are valid for B = ( a 


Frobenius’s genius was to see that the difference between A and B is in terms of matrix 
reducibility (or irreducibility) and to relate these ideas to spectral properties of nonnegative 
matrices. The next section introduces these ideas. 


Graph and Irreducible Matrices 


A graph is a set of nodes { Ni, No,..., N,} and a set of edges {E1, Fo,..., E,} between 
the nodes. A connected graph is one in which there is a sequence of edges linking any pair 
of nodes. For example, the graph shown on the right-hand side of Figure 15.1 is undirected 
and connected. 


A directed graph is a graph containing directed edges. A directed graph is said to be 
strongly connected if for each pair of nodes (N;, Nj.) there is a sequence of directed edges 
leading from N; to N;,. The graph on the left-hand side of Figure 15.1 is directed but not 
strongly connected (e.g., you can’t get from N3 to N;). 


E¢ 


Undirected and connected Directed but not strongly connected 


Figure 15.1 


Each graph defines two useful matrices—an adjacency matrix and an incidence ma- 
trix. For a graph G containing nodes { Nj, No,..., Nn}, the adjacency matrix Lyn is the 
(0, 1)-matrix having 

— { 1 if there is an edge from N; to N;, 
iG= ; 
0 otherwise. 


If G is undirected, then its adjacency matrix L is symmetric (i.e., L = L”). For example, 
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the adjacency matrices for the two graphs shown in Figure 15.1 are 


Ny No Nz Na Ny No Nz Na 
Ni fo 1 1 1 NMfo o0 1 1 
ree, Oe] eT No{ 1 0 1 0 
1=n3/ 1 1 +0 «1 2=n3/ 0 0 0 0 
Na\ 1 1 1 0 N\ 0 1 1 0 


For an undirected graph G with nodes {N,, No,...,.N,} and edges {E), Fo,..., F,}, the 
incidence matrix Cx is the (0, 1)-matrix having 

bas { 1 if node N; touches edge £;, 

re 0 otherwise. 
If G is a directed graph, then its incidence matrix is the (0, —1, 1)-matrix having 

1 ifedge E; is directed toward node N;,, 

Cj = 4 —1 ifedge F; is directed away from node Nj, 
0 if edge E; neither begins nor ends at node Nj. 


For example, the incidence matrices for the two graphs shown in Figure 15.1 are 


E, E, E3; E, Es Ee E, E. Es; E, Es £6 

1 1 0 0 1 40 Ni 128 0. 0.1; 0 

= 1 0 1 1 +0 =~0 Naf -1 0-1 1 +0 0 
Ci= nNz3/ 0 0 1 0 1 1 | and C2 = N3 0 o 1 0 1 1 
0 1 0 1 0 1 N4 O: DT. “Oy wets sor i 


There is a direct connection between the connectivity of a directed graph and the 
rank of its incidence matrix. 


Connectivity and Rank 
A directed graph with n nodes and incidence matrix C is connected if and only if 


rank (C) =n-—1. (15.2.1) 


For undirected graphs, arbitrarily assign directions to the edges to make the graph 
directed and apply (15.2.1) [127, p. 203]. 


Instead of starting with a graph to build a matrix, we can also do it in reverse—i.e., 
start with a matrix and build a graph. Given a matrix A. xn, the graph of A is defined to be 
the directed graph G(A) ona set of nodes { Nj, N2,..., Nn} in which there is a directed 
edge leading from N; to N; if and only if a;; 4 0. For example, if A = é 3) , then the 
graph G(A) looks like this: 


1) 2) 


ee 


Any product of the form P? AP in which P is a permutation matrix (a matrix ob- 
tained from the identity matrix I by permuting its rows or columns) is called a symmetric 
permutation of A. The effect of a symmetric permutation to a matrix is to interchange rows 
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in the same way as columns are interchanged. The effect of a symmetric permutation on 
the graph of a matrix is to relabel the nodes. Consequently, the directed graph of a matrix 
in invariant under a symmetric permutation. In other words, G(P7 AP) = G(A) whenever 


P is a permutation matrix. For example, if P is the permutation matrix P = (° i) , and 


: : 10 
if we again use A = é > then 


Rite kane 
P AP= (5 ay (15.2.2) 


and the graph G(P? AP) looks like this: 


0) 0) 


2) 1) 


Ne 


Matrix A, is said to be a reducible matrix when there exists a permutation matrix 
P such that 


xX Y 


T = 
prap= (4 Z, 


) , Where X and Z are both square. (15.2.3) 
For example, the matrix A in (15.2.2) is clearly reducible. Naturally, an irreducible matrix 
is a matrix that is not reducible. 


As the following theorem shows, the concepts of matrix irreducibility (or reducibil- 
ity) and strong connectivity (or lack thereof) are intimately related. 


Irreducibility and Connectivity 
A square matrix A is irreducible if and only if its directed graph is strongly con- 
nected. In other words, A is irreducible if and only if for each pair of indices (7, 7) 
there is a sequence of entries in A such that @jx, @k, hk. °-* @k,j A 0. Equivalently, 
A is irreducible if for all permutation matrices P, 


P7AP 4 G y , where X and Z are square. 


For example, can you determine if 


0 12 0 0 
000 7 0 
A=]2 0 0 0 0 
09 2 0 4 
000 1 0 


is reducible or irreducible? It would be a mistake to try to use the definition because 
deciding on whether or not there exists a permutation matrix P such that (15.2.3) holds by 
sorting through all 6 x 6 permutation matrices is pretty hard. However, the above theorem 
makes the question easy. Examining G(A) reveals that it is strongly connected (every node 
is accessible by some sequence of paths from every other node), so A must be irreducible. 
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The Perron—Frobenius Theorem 


Frobenius’s contribution was to realize that while properties 1, 3, 4, and 6 in Perron’s theo- 
rem for positive matrices can be lost when zeros creep into the picture (i.e., for nonnegative 
matrices), the trouble is not simply the existence of zero entries, but rather the problem is 
the location of the zero entries. In other words, Frobenius realized that the lost properties 
1, 3, and 4 are in fact not lost when the zeros are in just the right locations—namely the 
locations that ensure that the matrix is irreducible. Unfortunately irreducibility alone still 
does not save property 6—it remains lost (more about this issue later). 


Below is the formal statement of the Perron—Frobenius theorem—the details con- 
cerning the proof can be found in [127]. 


Perron—Frobenius Theorem 
If A,,xn > O is irreducible, then each of the following is true. 


r=p(A)>0. 
r€o(A) (ris the Perron root). 
alg mult, (r) =1. (the Perron root is simple). 


There exists an eigenvector x > O such that Ax = rx. 


ee ae aa aaa 


The Perron vector is the unique vector defined by 


Ap =rp, p > 0,||pll], = 1, 


and, except for positive multiples of p, there are no other nonnegative eigenvec- 
tors for A, regardless of the eigenvalue. 


6. r need not be the only eigenvalue on the spectral circle of A. 


7. r= maxxen f(x), (the Collatz—Wielandt formula), 


where f(x) = min aL and N = {x|x > 0 withx 4 0}. 


1<i<n ; 
2,40 4 


Primitive Matrices 


The only property in Perron’s theorem for positive matrices on page 168 that irreducibility 


is not able to salvage is the sixth property, which states that there is only one eigenvalue on 


the spectral circle. Indeed, A = ( : ) is nonnegative and irreducible, but the eigenvalues 


+1 are both on the unit circle. The property of having (or not having) only one eigenvalue 
on the spectral circle divides the set of nonnegative irreducible matrices into two important 
classes. 
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Primitive Matrices 


e A matrix A is defined to be a primitive matrix when A is a nonnegative irre- 
ducible matrix that has only one eigenvalue, r = p (A), on its spectral circle. 


e A nonnegative irreducible matrix having h > 1 eigenvalues on its spectral circle 
is said to be imprimitive, and h is called the index of imprimitivity. 


e If A is imprimitive, then the h eigenvalues on the spectral circle are 
{r, rw, rw”, ..., rw"1}, where w= ee 

In other words, they are the h‘” roots of r = p(A), and they are uniformly 

spaced around the circle. Furthermore each eigenvalue rw* on the spectral circle 

is simple. 


So what’s the big deal about having only one eigenvalue on the spectral circle? Well, 
primitivity is important because it’s precisely what determines whether or not the powers 
of a normalized nonnegative irreducible matrix will have a limiting value, and this is the 
fundamental issue concerning the existence of the PageRank vector. The precise wording 
of the theorem is as follows. 


Limits and Primitivity 
A nonnegative irreducible matrix A with r = (A) is primitive if and only if 
limp_.o0(A/r)* exists, in which case 


lim 
k-00 


eu Sai (15.2.4) 


r 


where p and q’ are the respective right-hand and left-hand Perron vectors for A. 


If Anxn > O is irreducible but imprimitive so that there are h > 1 eigenvalues 
on the spectral circle, then it can be demonstrated [127] that each of these eigenvalues is 
simple and that they are distributed uniformly on the spectral circle in the sense that they 


are the h‘” roots of r = p (A)—i.e., the eigenvalues on the spectral circle are given by 
{r, rw, rw?,..., rw}, where w= emih. 


Given a nonnegative matrix, do we really have to compute the eigenvalues and count 
how many fall on the spectral circle to check for primitivity? No! There are simpler tests. 


174 CHAPTER 15 


Tests for Primitivity 
For a square nonnegative matrix A, each of the following is true. 


e A is primitive if A is irreducible and has at least one positive diagonal element. 


e A is primitive if and only if A” > 0 for some m > 0. 


The first test above only provides a sufficient condition for primitivity, while the sec- 
ond condition is both necessary and sufficient—the first test is cheaper but not conclusive, 


while the second is more expensive, but absolutely conclusive. For example, to determine 
0 10 

whether or not the irreducible matrix A = (: 0 | is primitive, the first test doesn’t 
3.4 0 


apply because the diagonal of A is entirely zeros, so we are forced to apply the second test 
by computing powers of A. But the job is simplified by noticing that if B is the Boolean 


matrix defined by 
1 ifaj > 0, 
big = re if a,j = 0, 


then [B*];; > 0 if and only if [A*];; > 0 for every k > 0. Therefore, we only need 
to compute powers of B (it can be shown that no more than n? — 2n + 2 powers are 
required), and these powers require only Boolean operations AND and OR. The matrix 
A in this example is primitive because the powers of B are 


0 10 001 1 10 oO tod ais Se 8, 
B= (0 0 1). p= (1 1 0), B? = (0 1 '), B= (1 1 i), p= (1 1 ) 
1 1. 20 011 111 A a i eee 
While we might prefer our matrices to be primitive, Mother Nature doesn’t always 
cooperate. Mathematical models of physical phenomena that involve oscillations gener- 
ally produce imprimitive matrices, where the number of eigenvalues on the spectral circle 
(the index of imprimitivity) corresponds to the period of oscillation. Consequently, it’s 
worthwhile to have a grasp on the index of imprimitivity. While the powers of an irre- 
ducible matrix A > O can tell us if A has more than one eigenvalue on its spectral circle, 
the powers of A provide no clue to the number of such eigenvalues. The issue is more 
complicated—the following theorem is the primary theoretical aid in determining the in- 
dex of imprimitivity short of actually computing all eigenvalues. 


Index of Imprimitivity 
If c(x) = 2 + cp, a?" + cya” —*2 +--+ + en, a"—** = 0 is the characteristic 
equation of an imprimitive matrix A,,,.,, in which only the terms with nonzero 
coefficients are listed (i.e., each c,, A 0, andn > (n—k,) > +--+ > (n—kg)), then 
the index of imprimitivity h is the greatest common divisor of {k1, ko, ..., ks}. 


Finally, it is often useful to decompose an imprimitive matrix, and the Frobenius 
form is the standard way of doing so. 
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Frobenius Form 


For each imprimitive matrix A with index of imprimitivity h > 1, there exists a 
permutation matrix P such that 


(0) Aj2 (0) (0) 
(0) (0) A23 0 
PEAR Weegee yo 525) 
0 O - O An-ip 
Ani 0 coo 6) (0) 


where the zero blocks on the main diagonal are square. 


15.3 MARKOV CHAINS 


The mathematical component of Google’s PageRank vector is the stationary distribution of 
a discrete-time, finite-state Markov chain. So, to understand and analyze the mathematics 
of PageRank, it’s necessary to have an appreciation of Markov chain concepts, and that’s 
the purpose of this section. Let’s begin with some definitions. 


e A stochastic matrix is a nonnegative matrix P,,,.,, in which each row sum is equal 
to 1. Some authors say “row-stochastic” to distinguish this from the case when each 
column sum is 1. 


e A stochastic process is a set of random variables { X;}?2 having a common range 
{S1,S2,...,Sn}, which is called the state space for the process. Parameter t is 
generally thought of as time, and X; represents the state of the process at time ¢. For 
example, consider the process of surfing the Web by successively clicking on links 
to move from one Web page to another. The state space is the set of all Web pages, 
and the random variable X; is the Web page being viewed at time ¢. 


— To emphasize that time is considered discretely rather than continuously the 
phrase “discrete-time process” is often used, and the phrase “finite-state pro- 
cess” can be used to emphasize that the state space is finite rather than infinite. 
Our discussion is limited to discrete-time finite-state processes. 


e A Markov chain is a stochastic process that satisfies the Markov property 
P(Xey1 = Sj | Xe=Si,, Xt-1=Si,_1, «--, XO=Sig) = P(Xt41 = Sj | Xt = Six) 


for each t = 0,1,2,.... The notation P(E’ | F’) denotes the conditional probability 
that event & occurs given event F’ occurs—a review some elementary probability is 
in order if this is not already a familiar concept. 


— The Markov property asserts that the process is memoryless in the sense that 
the state of the chain at the next time period depends only on the current state 
and not on the past history of the chain. For example, the process of surfing the 
Web is a Markov chain provided that the next page that the Web surfer visits 
doesn’t depend on the pages that were visited in the past—the choice depends 
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only on the current page. In other words, if the surfer randomly selects a link 
on the current page in order to get to the next Web page, then the process is a 
Markov chain. This kind of chain is referred to as a random walk on the link 
structure of the Web. 


e The transition probability p;,;(t) = P(X: = S; | X+t-1 = S;) is the probability of 


being in state S,; at time ¢ given that the chain is in state S; at time t — 1, so think of 
this simply as the probability of moving from 5; to S; at time ft. 


The transition probability matrix Pnyn(t) = [pi;(t)] is clearly a nonnegative ma- 
trix, and a little thought should convince you that each row sum must be 1. In other 
words, P(t) is a stochastic matrix for each t. 


A stationary Markov chain is a chain in which the transition probabilities do not 
vary with time—i.e., p;;(t) = pi; for all t. Stationary chains are also known as 
homogeneous chains. 


— In this case the transition probability matrix is a constant stochastic matrix 


P = [p,,]. Stationarity is assumed in the sequel. 


— In such a way, every Markov chain defines a stochastic matrix, but the con- 


verse is also true—every stochastic matrix P,,., defines an n-state Markov 
chain because the entries p;; define a set of transition probabilities that can be 
interpreted as a stationary Markov chain on n states. 


e An irreducible Markov chain is a chain for which the transition probability matrix 


P is an irreducible matrix. A chain is said to be reducible when P is a reducible 
matrix. 


— A periodic Markov chain is an irreducible chain whose transition probability 


matrix P is an imprimitive matrix. These chains are called periodic because 
each state can be occupied only at periodic points in time, where the period is 
the index of imprimitivity. For example, consider an irreducible chain whose 
index of imprimitivity is h = 3. The Frobenius form (15.2.5) means that the 
states can be reorder (relabeled) to create three clusters of states for which the 
transition matrix and its powers have the form 

0 0 0 x O 

* °), Pia (0 0 ‘) a 

QO x x 0 O 


Oo x O 2 0 O x * 
P= 0 O x ,P = x O O , P= 
x O O Oo x O 0 


where this pattern continues indefinitely. If the chain begins in a state in cluster 
2, then this periodic pattern ensures that the chain can occupy a state in cluster 
7 only at the end of every third step—see transient properties on page 179. 


° 


An aperiodic Markov chain is an irreducible chain whose transition probability 
matrix P is a primitive matrix. 


e A probability distribution vector (or “probability vector” for short) is defined to be 


a nonnegative row vector p? = (p1,p2,---,Pn) such that >, py = 1. (Every row 
in a stochastic matrix is probability vector.) 


e A stationary probability distribution vector for a Markov chain whose transition 


probability matrix is P is a probability vector m7 such that 77 P = x7. 


T 
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e The k*” step probability distribution vector for an n-state chain is defined to be 
p’ (k) = (pi (k), p2(k), pens ,Pn(k)), where p,(k) =a P(X; = Sj). 


In other words, p;(k) is the probability of being in the j'” state after the k*” step, 
but before the (k + 1)** step. 


e The initial distribution vector is 


p’ (0) = (pi(0),p2(0),...,Pn(0)), where p,;(0) = P(Xo = $j). 
In other words, p;(0) is the probability that the chain starts in S;. 


To illustrate these concepts, consider the tiny three-page web shown in Figure 15.2, 
Figure 15.2 

where the arrows indicate links—e.g., page 2 contains two links to page 3, and vice versa. 

The Markov chain defined by a random walk on this link structure evolves as a Web surfer 


clicks on a randomly selected link on the page currently being viewed, and the transition 
probability matrix for this chain is the irreducible stochastic matrix 


0 1/2 1/2 
H={1/3 0 2/3 
L/3 Bye. <0 


In this example H (the hyperlink matrix ) is stochastic, but if there had been a dangling 
node (a page containing no links to click on), then H would have a zero row, in which case 
H would not be stochastic and the process would not be a Markov chain. ! 


If our Web surfer starts on page 2 in Figure 15.2, then the initial distribution vector 
for the chain is p7(0) = (0,1,0) = e4. But if the surfer simply selects an initial page 
at random, then p?(0) = (1/3,1/3,1/3) = e7 /3 is the uniform distribution vector. A 
standard eigenvalue calculation reveals that o (H) = {1, —1/3, /, —2/3}, so it’s apparent 
that H is a nonnegative matrix having spectral radius p (H) = 1. 


The fact that p (H) = 1 is a feature of all stochastic matrices P,,.,, because having 
row sums equal to 1 means that ||P||,, = 1 or, equivalently, Pe = e, where e is the 
column of all 1’s. Because (1, e) is an eigenpair for every stochastic matrix, and because 
p(x) < ||*|| for every matrix norm, it follows that it follows that 


1<p(P)<|Pl,,=1 = p(P)=1. (15.3.1) 


'As explained earlier, this is why Google alters the raw hyperlink matrix before computing PageRank. 
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Furthermore, e is a positive eigenvector associated with p(P) = 1. But be careful! This 


doesn’t mean that you necessarily can call e the Perron vector for P because P might not 
5.5 


. . 2 . _ 
be irreducible, “ e.g., consider P = ( 0 1 


Almost all Markovian analysis revolves around questions concerning the transient 
behavior of the chain as well as the limiting behavior, and standard goals are as follows. 
e Describe the k” step distribution p” (k) for any initial distribution vector p’ (0). 
e Determine if limz_... p’ (k) exists, and, if so, find the value of limz_,.. p’ (k). 


e When lim;_... p? (k) doesn’t exist, determine if the Cesaro limit 


T T res T (fe — 
Jim [MO@+P") +--+ eT— 1) 


exists, and, if so, find its value and interpret its meaning. 


Transient Behavior 


Given an initial distribution vector p’(0) = (pi (0), p2(0),...,Pn(0)), the first aim is to 
calculate the probability of being in any given state after the first transition (but before 
the second)—i.e., determine p7(1) = (pi (1), p2(1),...,Pn(1)). Let A and V respectively 
denote AND and OR. It follows from elementary probability theory that for each j, 


=P[(Xi=s; mx Xo=51) V (X1=S; A Xo=52) VereV (X1=S; A Xo=Sn)| 


=> P[xi=s; A Xo=5i] = SP [xX — s,|P[4 = Sj | Xo = si] 
i=1 1S 


= S- pi(0)piy. 
i=1 


In other words, p?(1) = p?(0)P, which describes the evolution from the initial distribu- 
tions to the distribution after one step. The “no memory” Markov property provides the 
state of affairs at the end of two steps—it says to simply start over but with p’ (1) as the 
initial distribution. Consequently, p7(2) = p?(1)P, and p?(3) = p7(2)P, etc., and 
successive substitution yields i 

p’(k) =p’ (0)P*, (15.3.2) 


which is simply a special case of the power method (15.1.20) except that left-hand vector- 
matrix multiplication is used. Furthermore, if P* = [PP], then setting p” (0) = ef in 


(15.3.2) yields p;(k) = i for each i = 1,2,...,n. Below is a summary. 


2 


?The need to force irreducibility is another reason why Google modifies the raw hyperlink matrix. 
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Transient Properties 
If P is the transition probability matrix for a Markov chain on states 
{$1,S2,...,.S,}, then each of the following is true. 


e The matrix P* represents the k-step transition probability matrix in the sense 
that its (i, j)-entry [P*],; = De is the probability of moving from S; to S; in 
exactly k steps. 


e The k*” step distribution vector is given by p? (k) = p7(0)P*. 


Limiting Behavior 


Analyzing limiting properties of Markov chains requires that the class of stochastic ma- 
trices (and hence the class of stationary Markov chains) be divided into four mutually 
exclusive categories. 


(1) Pis irreducible with lim,_,,, P* existing (i.e., P is primitive). 
(2) Pis irreducible with lim,_,,, P* not existing  (i.e., P is imprimitive). 
(3) Pisreducible with lim,_..,. P* existing. 

(4) Pisreducible with lim,_.,, P* not existing. 


In case (1) (an aperiodic chain) limp... P¥ can be easily evaluated. The Perron vector 


for P is e/n (the uniform distribution vector), so if 7 = (71, 72,..., 7p)! is the Perron 
vector for P’, (i.e., m7 P = 27) then, by (15.2.4), 


Ty 2 eee Tn 
vk qe TY 72 eee TT. 
ee al ey ce 2 "Tso. (15.3.3) 
k—00 mi(e/n) 7Te 2 : 
Ty 2 eee Tay 


Therefore, if P is primitive, then a limiting probability distribution exists and is given by 
Jim p’ (k) = Jim p’ (0)P* = p’(O)ex? =z". (15.3.4) 


Notice that because *, px (0) = 1, the term p?(0)e drops away, so the value of the limit 
is independent of the value of the initial distribution p? (0), which isn’t too surprising. 


In case (2), where P is irreducible but imprimitive, (15.2.4) insures that lim;,_,,, P* 
cannot exist, and hence limy_,.. p7 (k) cannot exist (otherwise taking p?(0) = e? for 
each 7 would insure that P* has a limit). However, the results on page 173 insure that the 
eigenvalues of P lying on the unit circle are each simple, so, by (15.1.18), P is Cesaro 
summable to the spectral projector G associated with the eigenvalue \ = 1. By recall- 
ing (15.1.12) and using the fact that e/n is the Perron vector for P, it follows that if 
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am! = (11, 72,...,7n) is the left-hand Perron vector, then 
Ty 12 eee Tn 
_ I+P+---+P*1 (e/n)xt ent a TM 2 +t Tn 
lim = eT F : . ; 
k—00 k m'(e/n) 7Te 
Ty 12 eee Tn 


which is exactly the same form as the limit (15.3.3) for the primitive case. Consequently, 
the k*” step distributions have a Cesaro limit given by 


= sa acre a a ~| 


: = lim p’(0) 


k—o0o 


lim 
k-00 


= 
k 


=p’ (0)er? =n", 


and, just as in the primitive case (15.3.4), this Cesaro limit is independent of the initial 


distribution. To interpret the meaning of this Cesaro limit, focus on one state, say S';, and 
let {Z;, }?2) be random variables that count the number of visits to 5; by setting 


z= 1 if the chain starts in S;, 
°~ 10 otherwise, 


and for2 > 1, 
z,— 1. ifthe chain is in S; after the it” move, 
‘ 0 otherwise. 


Notice that Zp) + Z; +--+ Z,—1 counts the number of visits to S; before the kt move, 
so (Zo + Z, +++++ Zy-1)/k represents the fraction of times that $; is hit before the k’” 
move. The expected (or mean) value of each Z; is 


E[Z;] =1-P(Zj=1) + 0- P(Zj=0) = P(Z;=1) = p; (i). 


Since expectation is linear, the expected fraction of times that S'; is hit before move k is 


E Zot At:::+Ze-1| _ EZ] + BZA] +--+: + E[Ze-1] 
k 7 k 


_ P(0) +p) +--+ pi(k-1) _ = Sy (Jan Otani eS (ie | 
k 


B j 

7 MT. 
In other words, the long-run fraction of time that the chain spends in S; is 7;, which is 
the j*” component of the Cesaro limit or, equivalently, the j“” component of the left-hand 
Perron vector for P. When lim;,_,.. p’ (k) exists, it is easily argued that 


een ; ao 


lim p?(k) = lim ; 


k-00 k— oo 


so the interpretation of the limiting distribution lim,_,.. p’(k) for the primitive case is 
exactly the same as the interpretation of the Cesaro limit in the imprimitive case. Below is 
a summary of irreducible chains. 
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Irreducible Markov Chains 


Let P be the transition probability matrix for an irreducible Markov chain on 
states (51, S2,..., 5}, and let 2” be the left-hand Perron vector for P (ie., 
a? P = x", ||7||, = 1). The following hold for every initial distribution p7 (0). 


e The k‘” step transition matrix is P*. In other words, the (7, j)-entry in P” is the 
probability of moving from S; to S; in exactly k steps. 


The k‘” step distribution vector is given by p7 (k) = p7(0)P*. 


e If P is primitive (so the chain is aperiodic), and if e is the column of all 1’s, then 


lim P* = en? jim p(k)=7'. 


k—oo 
e If P is imprimitive (so the chain is periodic), then 


: I+P+---+P*! = 
lita =enr 


p? (0)+p? (1+: +p" (k-1)] _ or 


inna = 7 


e Regardless of whether P is primitive or imprimitive, the j*” component TT; Of 
a’ represents the long-run fraction of time that the chain is in 9 igo 


e The vector 77 is the unique stationary distribution vector for the chain because 


it is the unique probability distribution vector satisfying 77 P = 17. 


Reducible Markov Chains 


The Perron—Frobenius theorem is not directly applicable to reducible chains (chains for 
which P is a reducible matrix), so the strategy for analyzing reducible chains is to deflate 
the situation, as much as possible, back to the irreducible case. If P is reducible, then, by 
definition, there is a permutation matrix Q and square matrices X and Z such that 

x Y ) 


Q™PQ = es x For convenience, denote this by writing P ~ ( 0 Z 


0 Z 


If X or Z is reducible, then another symmetric permutation can be performed to produce 


x Y R S T 
( og ) ~{0 U V }, where R, U, and W are square. 
0 Oo W 
Repeating this process eventually yields 
Xi1 Xing +++ Kaz 
O X22 ++: Xx _ ; 
Pw : : . where each X;; is irreducible or X;; = [0]1x1. 


o] 


0) 0 +++ Kee 
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Finally, if there exist rows having nonzero entries only in diagonal blocks, then symmetri- 
cally permute all such rows to the bottom to produce 


Pit Pio -:: Pir Piorti Pirt2 -** Pim 
O Po ::: Par Parti P2742 sss) Pom 
0) O «ss Ppp Pritt Pp r42 see Pom 
Pw ; (15.3.5) 
0 oOo: Oo Prats 0 0 
0 oOo: Oo 0 Peiesiie ste 0 
(0) Oo :-- O (0) (e) 7+) Pmm 
where each P11,...,P,-, is either irreducible or [0]1x1, and P,4i,-41,---,Pmm. are ir- 


reducible (they can’t be zero because each has row sums equal to 1). As mentioned on 
page 171, the effect of a symmetric permutation is simply to relabel nodes in G(P) or, 
equivalently, to reorder the states in the chain. When the states of a chain have been re- 
ordered so that P assumes the form on the right-hand side of (15.3.5), we say that P is in 
the canonical form for reducible matrices. 


The results on page 173 guarantee that if an irreducible stochastic matrix P has h 
eigenvalues on the unit circle, then these h eigenvalues are the h*” roots of unity, and each 
is a simple eigenvalue for P. The same can’t be said for reducible stochastic matrices, but 
(15.3.5) leads to the next best result (the proof of which is in [127]). 


Unit Eigenvalues 
The unit eigenvalues are those eigenvalues that are on the unit circle. For every 
stochastic matrix P,,..,, the following statements are true. 


e Every unit eigenvalue of P is semisimple. 
e Every unit eigenvalue has form \ = e?*74/" for some k < h <n. 


e In particular, p (P) = 1 is always a semisimple eigenvalue of P. 


The discussion on page 163 says that a matrix A,» is Cesaro summable if and 
only if p(A) < 1 or p(A) = 1 with each eigenvalue on the unit circle being semisimple. 
Since the result above says that the latter holds for all stochastic matrices P, we have the 
following powerful realization concerning all stochastic matrices. 


All Stochastic Matrices Are Summable 
Every stochastic matrix P is Cesaro summable in the sense that 


a O2% k-1 
lim I+P+ +P -@ 
k-o0o k 


always exists and, as discussed on page 163, the value of the limit is the spectral 
projector G onto N (I— P) along R(I— P). 
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The structure and interpretation of the Cesaro limit when P is an irreducible stochas- 
tic matrix was developed on page 181 so to complete the picture all that remains is to 
analyze the nature of limy,_.,, (I+ P+---+P*~')/k for the reducible case. 


Tir Tie 


Suppose that P = ( 0 Too 


ical form (15.3.5), where 


) is a reducible stochastic matrix that is in the canon- 


Pi ott Pig Piri te Ping Pr4iyrt1 
Tu= 6 |, Ties : : » P22 = me 
Prr Prir4i --+ Prm Pmm 


Because each row in T,, has a nonzero off-diagonal block, it follows that p (Px) < 1 for 
each k = 1,2,...,7r. Consequently, p(T11) < 1, and 


I+Tu+-+Ti 


ee eee 
Furthermore, P,.1-41,---,;Pmm are each irreducible stochastic matrices, so if m is the 
left-hand Perron vector for P;;, r+ 1 < 7 < m, then (15.1.12) combined with (15.1.18) 
yields ie 
eT 
I+ Ty.+---4+7%>1 
fi ee eS - =E. 
k—00 k em! 
It’s clear from (15.2.4) that limz_.., Ts, exists if and only if P,41741,---,;Pmm are 


each primitive, in which case limyz_.., a = E. Therefore, the limits, be they Cesaro or 
ordinary (if it exists), all have the form 


I+P+---+P*! (3 Z 


aie j =l\o £ 


) =G= jim P* (when it exists). 


To determine the precise nature of Z, use the fact that R(G) = N (I — P) (because G is 
the projector onto N (I — P) along R (I — P)) to write 


I-T —-T 0 Z 
(I-P)G=0 = ( fa aes n) =0 — (I-Ty,)Z=TpE. 


Since I — T,, is nonsingular (because p(T 11) < 1), it follows that 
Z=(1-—Ty) TE, 


and thus the following results concerning limits of reducible chains are produced. 
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Reducible Markov Chains 


If the states in a reducible Markov chain have been ordered to make the transition 
matrix assume the canonical form 


that is described in (15.3.5), and if ny is the left-hand Perron vector for P;; 
(r+1<¥7 <m), then I — Ty, is nonsingular, and 


+ I+P+.-..+P*- 0 (I—Ty)7'Ti2k 
im = 
k—o0o k 0 E 
where oe 
BE — 
emt 
Furthermore, lim,..P” exists if and only if the stochastic matrices 
Py41,r41;---,Pmm in (15.3.5) are each primitive, in which case 
0 (I—Ty,)7!Ti2E 
lim PF = ( USUI ) (15.3.6) 


Transient and Ergodic Classes 


When the states of a chain are reordered so that P is in canonical form (15.3.5), the subset 
of states corresponding to P;; for 1 < k < ris called the k*” transient class because once 
left, a transient class can’t be reentered. The subset of states corresponding to P,.+57+4; 
for j > 1 is called the j‘” ergodic class. Each ergodic class is an irreducible Markov chain 
unto itself that is imbedded in the larger reducible chain. From now on, we will assume 
that the states in reducible chains have been ordered so that P is in canonical form (15.3.5). 


Every reducible chain eventually enters one of the ergodic classes, but what happens 
after that depends on whether or not the ergodic class is primitive. If P,.+; +; is primitive, 
then the chain settles down to a steady state defined by the left-hand Perron vector of 
P,-+5,r4+;, but if P,+;,-4; is imprimitive, then the process will oscillate in the i” ergodic 
class forever. There is not much more that can be said about the limit, but there are still 
important questions concerning which ergodic class the chain will end up in and how long 
it takes to get there. This time the answer depends on where the chain starts—i.e., on the 


initial distribution. 


For convenience, let J; denote the i*” transient class, and let €; be the gih ergodic 
class. Suppose that the chain starts in a particular transient state—say we start in the p*” 
state of J;. Since the question at hand concerns only which ergodic class is hit but not what 
happens after it’s entered, we might as well convert every state in each ergodic class into 
a trap by setting P,-+5,74; =I for each 7 > 1 in (15.3.5). The transition matrix for this 
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modified chain is P = Ca os a ) _ and it follows from (15.3.6) that limp_.oo P* exists 


and has the form 


0 0 :-. O Lii Li --- Lis 

0 0 ---. 0 Loi Le. --- Las 

Sp 0 (I-Ti)7'Ti2 0 0 0 | Lr L,2 Lrs 
lim P* = = 

k—00 0 I 0 0 0 I 0 0 

0 O 0 i) I O 

0 O ::- O 0) O oo I 


Consequently, the (p, gentry in block L;,; represents the probability of eventually hitting 
the g‘” state in €; given that we start from the p'” state in J;. Therefore, if e is the vector 
of all 1’s, then the probability of eventually entering somewhere in €; is given by 


Lie] . 


P(absorption into €;| start in p™” state of T;) = S- [Lis] on =a i" 
k 


If p? (0) is an initial distribution for starting in the various states of 7;, then 


P (absorption into €;| p; (0) =p; (0)Lije. 


The expected number of steps required to first hit an ergodic state is determined as 
follows. Count the number of times the chain is in transient state S; given that it starts in 
transient state S; by reapplying the argument given in on page 180. That is, given that the 
chain starts in Sj, let 


_ Ji ifS;=S;, _ f 1 ifthe chain is in S; after step k, 
Zo = YZ = ; j 
0 otherwise, 0 otherwise. 


Since 


E[Z,] =1-P(Zp=1) + 0+ P(Z,=0) = P(Z,=1) [Til ,;. 


and since )>;°_, Zx is the total number of times the chain is in S;, we have 


E[# times in $;| start in Sj] =E |S> Z| = $0 E[Z) = >> [Th], 
k=0 k=0 k=0 


=(|(I- ei) | (because p(T11) < 1). 


Summing this over all transient states produces the expected number of times the chain 
is in some transient state, which is the same as the expected number of times before first 
hitting an ergodic state. In other words, 


: : . -th . _ —1 
E[# steps until absorption | start in i‘” transient state] = [(I —'T11)‘e] e 
It’s often the case in practical applications that there is only one transient class, and 
the ergodic classes are just single absorbing states (states such that once they are entered, 
they are never left). If the single transient class contains r states, and if there are s absorb- 
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ing states, then the canonical form for the transition matrix is 


Pill oc'* Pir Pl,r+1 ‘"* Pls 
Pri oct) Prr Pryr+l *'* Prs 
P= (15.3.7) 
Oar Uo Lae 6 
0 0 0 1 


In this case, L;; = [(I _ Tu)? Ti] ip and the earlier development specializes to say that 
every absorbing chain must eventually reach one of its absorbing states. The absorption 
probabilities and absorption times are included in the following summary. 


Absorption Probabilities and Absorption Times 


: ; a P T T he, 
For a reducible chain whose transition matrix P = ( a Too ) is in the canon- 


ical form (15.3.5), let 7; and €; be the it? and aie transient and ergodic classes, 
respectively, and let p/ (0) be an initial distribution for starting in the various 
states of T;. If (I —'T11)~'T 12 is partitioned as 


IW a Weioy 5e6 bibs 

a Loi Lo22 --- Las 
I=Ti1)7 Tis= : a \ 

L,1 L,.2 --> Lrg 


then 

e P(absorption into €;| p; (0)) = p; (0) Lie, 

e P(absorption into €;| start in p*” state of J) = >>, [Lis] a = [Lise], 

e E[# steps until absorption | start in” transient state] = [(I — Ti) ‘el, ; 
When there is only one transient class and each ergodic class is a single absorbing 
state (€; = S,.;), P has the form (15.3.7). If S; and S; are transient states, then 


e P(absorption into S,,;| startin S;) = [(I— Ty1)7' Ty] ip? 


e E[# steps until absorption | start in S;] = [(I — Ti:)~'e], , 


e E[# times in S;| start in S;] = [(I — area all J 


15.4 PERRON COMPLEMENTATION 


The theory of stochastic complementation in section 15.5 concerns the development of 
methods that allow the stationary distribution of a large irreducible Markov chain to be 
obtained by gluing together stationary distributions of smaller chains. The concepts are 
based on the theory of Perron complementation, which describes how the Perron vector of 
a large irreducible matrix can be expressed in terms of Perron vectors of smaller matrices. 
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Perron Complements 
Partition an irreducible A,,x, > O with spectral radius p(A) = r, as 


Ai 4M eee Bip 
Aoi Ao2 --- Aor 

N= , ; : d : (15.4.1) 
Agi Ago «+: Axkk 


where all diagonal blocks are square. The Perron complement of the i‘” diagonal 
block A,;is defined to be the matrix 


P, =A, +A, (rl— At) “Ay, (15.4.2) 


where A,, and A,,; are, respectively, the i*” row and the 7” column of blocks 
with A,; removed, and A* is the principal submatrix of A obtained by deleting 
the i*” row and i‘” column of blocks. The nonsingularity of rI — A* is discussed 
on page 188. 


Ait Aig 


For example, if A = Ge Aoe 


Perron complements are 


P, = Aq + Ajo(rI — Ago)” ' Aoi P2 = Ago + Aoi (rl — Aqi)7' Ad. 


) > 0 is irreducible with p(A) = r, then the two 


Ait Ai2 Ais 

If A is partitioned as A = (An Aso Ao3 ) , then there are three Perron complements, 
Az1 Az32 Ags 

and the second one is 


-1 
rl — Ai —Ais Aji 
Pp =Agn+(A A ; 
2 22 ( 21. 23 ) ( =Asi rl _ Ass Aso 
with the other two complements, P; and P3, being similarly formed. 


For A = Ce an ) , the more familiar Schur complements are defined [127] to 
be 


Ai - Ai2A55 Ani Ave _ Aon Aj An, 


so, while they are not the same, the Perron complements are related to the Schur comple- 
ments by the following construction. 


1. Shift A by rI by constructing A — rI. 
2. Form Schur complements C;. 


3. Shift the results back by constructing rI + C;. 


This is not the only reason for the terminology “Perron complement’”—the other rea- 
sons will become evident as other developments unfold. The salient feature of all Perron 
complements is that they inherit “Perron properties” from their parent matrix in the sense 
that if A is nonnegative and irreducible, then so is each Perron complement P; that is 
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derived from A. Furthermore, if p(A) = r, then p(P;) = r for each 7. And, most im- 
portantly, the Perron vectors of the P;’s combine to form the Perron vector of the parent 
matrix A. Before these things can be understood, some preliminary results are needed. 
The first such result is the converse to part of the Perron—Frobenius Theorem on page 172. 


Irreducibility Revisited 


Anxn = 0 is irreducible if and only if A has a simple positive eigenvalue \ > 0 
that is associated with a positive right-hand eigenvector p > O as well as a positive 
left-hand eigenvector q’ > 0. 


Proof. Suppose that A has a simple eigenvalue > 0 associated right-hand and left-hand 
eigenvectors p > 0 and q? > 0, respectively. If D = diag (p1, po,..., Pn), then 


D-!AD 
P=—— (15.4.3) 


is a stochastic matrix that is irreducible if and only if A is irreducible. And 1 is a sim- 
ple eigenvalue of P associated with the respective right-hand and left-hand eigenvectors 
D~-'p = e > Oand q’D > O. Consequently, P is Cesaro summable to the spectral 
projector G onto N (I — P) (page 182). The simplicity of 1 € o (P) means that 
_ D™'pq?D 
q’p 
This ensures that P (and hence A) is irreducible. Otherwise, [P*] 
and for all k, so 
I+P+---+P*! 
k 


>0 (recall (15.1.12) on page 161). 


jj — O for some i # j 


=0 fork=1,2,... => G,;=0. I 
aj 
In order for a Perron complement P; = Aj; + Ai,(rI — A*x)~'A,; to be well 


defined, the existence of (rI — A*)~! must be ensured. This, along with the fact that 
(rI — A*)~! > 0, is the point of the next theorem. 


Principal Submatrices 


Let Anxn > O be irreducible with p(A) = r, and partition A as in (15.4.1). 
If A* is the principal submatrix of A obtained by deleting the i*” row and i*” 
column of blocks, then 


p(AF) <r, (15.4.4) 
(rI — A*) is nonsingular, and (rI — A¥)~' > 0. (15.4.5) 


In other words, rI — A* is an M-matrix as described on page 166. 
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Proof. To prove that p(A¥) < r, suppose to the contrary that r < p(A*). If Q is the 
permutation matrix such that 


T — Ay Axi 2% : D _ A* 0 
Q ag= (Ri A, )=A, andif B=(“Q g): (15.4.6) 


then p(A) = p(A) =r < p(AyX) = p(B). Furthermore A > B > 0 ensures that 
p(A) > p(B) [127, pg 619], sor = p(B) = p(A*). But this impossible because Perron’s 
theorem for nonnegative matrices (page 168) guarantees the existence of a vector v > 0, 
v #0, such that A*v = rv, soz = (v 0)* is a nonnegative nonzero vector v such that 
Bz = rz. It follows from A > B that Az > Bz = rz, and it’s a straightforward exercise 
[127, pg 674] to show that this implies Az = rz with z > O, which is a contradiction. 
Thus p(A*) < r. The fact that (rI — A*) is nonsingular and (rI — A*)~! > 0 can be 
deduced from the Neumann series expansion (15.1.16) on page 162. JJ 


As discussed below, Perron complements inherit most of the useful properties that 
their parent matrix possesses. 


Inherited Perron Properties 
If Anxn > O is an irreducible matrix with p(A) = r that is partitioned as in 
(15.4.1), and if P; = Ay; + Aj. (rI — A*)~1A,,; is the i*” Perron complement 
as defined in (15.4.2), then 


P; => 0 for every i, (15.4.7) 
P; is irreducible for every 7, (15.4.8) 
p(P;) = r for every i. (15.4.9) 


Proof. P,; => 0 because (rI-A*)! > 0, and all of the other terms in P; are nonnegative. 


To see that P; is irreducible, let p = (3) be the partitioned right-hand Perron vector for 
the nonnegative irreducible matrix A in (15.4.6) so that (rI— A)p = 0. The lower part of 


rI—A¥ —Ay; x\) (0 I 0 rI—A¥ —Ay; x\ (0 
-Ai, rl—- Au) \y) ~\0) 7 \An(@d—-At)-1 1) -Aw rt- Au) \y) = (0 

yields 
(rl — P;)y =0, (15.4.10) 


and thus (r, y) is a right-hand eigenpair for P; with y > 0. A similar argument shows that 
there is also a left-hand eigenpair (r,z’) for P; with z’ > 0. Furthermore, r is a simple 
eigenvalue of P; because Perron—Frobenius insures that r is a simple eigenvalue of A, as 
well as A, so this together with 


I 0\ (rI-At —-Ayi \ (I (rI-At)1Ay:) _ (rI- At OO 
Aix (rl — A*)71 I —-A;, rI- Ay O I = 0) rl—P; 


and the fact that (rI — A*) is nonsingular produces 
1 = dim N(rI—A) = dim N(rI—A*)+dim N(rI—P,) = dim N(rI—P;). (15.4.11) 


Since P; can be transformed into a stochastic matrix without altering multiplicities as 
described in (15.4.3), and since that the spectral radius of a stochastic matrix is semisimple 
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(p. 182), it follows that r is a semisimple eigenvalue for P;. Hence (15.4.11) insures that 
r is a simple eigenvalue for P;. The irreducibility of P; is now a consequence of the 
result on page 188. Finally, part of the Perron—Frobenius theorem (p. 172) states that a 
nonnegative irreducible matrix can have no nonnegative eigenvectors other than multiples 
of the positive Perron vector associated with the spectral radius. Therefore, since (1, y) is 
an eigenpair for P; with y > 0, it follows that p(P;) = r, where z; = y/|ly||1 is the 
associated Perron vector. [ff 


The above proof is more important than it might first appear to be because it reveals 
a significant relationship between the Perron vector of A and the Perron vector of P,. If 
the Perron vector for A is partitioned conformably with the partition in (15.4.1) as 


Pl 
P2 
p= - fs 
Pk 
then the nature of the permutation in (15.4.6) makes it clear that p; = y, where y > 0 is 
the vector in (15.4.10). Consequently, the Perron vector for P; is 
y y Pi 
lly; e7y = eTp; 


Zi 


or, equivalently, 
pi = izi, where £; =e pj. (15.4.12) 


In other words, the Perron vectors z; of smaller Perron complements can be glued together 
to build the Perron vector of A by writing 

121 

§2%2 

p= ; : (15.4.13) 

EkZk 
This looks like a nice result until you realize that the glue is the set of scalars £; = e” p,, 
so We are going in circles if we need to use the components of p in order to compute the 
components p. Fortunately, there’s a clever way out of this dilemma by manufacturing 
the glue from the Perron vector of a coupling matrix C, which is yet another matrix 
that inherits its Perron properties from the parent matrix A. The following theorem brings 
everything together. 
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The Coupling Theorem 

Suppose A,,.x7, > 0 is irreducible with p(A) = r that is partitioned into k levels 
as in (15.4.1). Let p and z; be the respective Perron vectors of A and the Perron 
complement P;; defined in (15.4.2). The matrix 
eTAiyz --- eFAipzy 

ore 

ePAniz --: CP Anrzr Fsvels 

is called the coupling matrix , and it has the following properties. 
e C is nonnegative and irreducible. 


e p(C)=r. 
&1 
&2 
e The Perron vector for C, called the coupling vector, is given by € = ale 
& 
where €; = e' p; is as defined in (15.4.12). 
Pi €1Z1 
P2 €2z2 
e The Perron vector for A is given by p = a : 
Pk En zk 


Proof. C > 0 because each term c;; = e! Aj 52; is nonnegative. C is irreducible because 
cij = 0 => Aj; = 0 Cif C could be permuted to a block triangular form, then so could 
A). To prove the rest of the theorem, notice that C = RAL, where R and L are given by 


eF O .::-. O zy O -: O 
0 e ... Q 0 z --- O 
R= : 2 ue, : ae (ee ee ; 
. . * a . . . * 
0 0 pete TS kxn 0 0 eis ke nxk 


We know from (15.4.12) that LE = p and Rp = &, so 
Cé = RALE = RAp = R(rp) = ré. 


Furthermore, € > 0 (because p; > O for each i), and e7 € = e’ Rp = e’ p = 1. It now 
61z 
bie, 

follows that r = p(C) and € is the Perron vector for C. The conclusion that p = 
ae 


comes from (15.4.13). [ff 


The matrices R. and L in the above proof are special cases of transformations known 
respectively as restriction and prolongation operations because when n > k, R “re- 
stricts” tuples down to /tuples while L “prolongates” tuples back up to ntuples in an 
inverse-like manner since RL = I. Restriction-prolongation techniques like the one above 
are popular tools in applied and numerical work. 
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To solidify the concepts of Perron complementation, consider the following exam- 


ple. The matrix 
| _ Ai Arp 
Ao: Aso 


ea 


is irreducible with p (A) = 7, and the two Perron complements are 


2 1 
4 2 
0 3 


er 1ow 


0 
3 
2 
1 2 


29 10 : 
P,;=Aiy+ Aji2(71 = Ao) tAat = ; © _ ; with p(P1) = 7, 
and 
29 40 : 
P= Ag? + Aoi (7 — Ai) Ai = 4 & a , with p(P2) = 7. 


The respective Perron vectors for P; and P2 are 


and the coupling matrix is 
eT Ariz, e Ajoz 4 3 
=( ie a at with p(C) = 7. 
e! Aoi Zz e Asoz2 3.4 


1/2 
1/2 


1 
= Cuan) 
1 


Knowledge of p (A) is required to form the Perron complements of A, and this can 
be a bottleneck in some situations. However, there are important applications in which 
the spectral radius is known in advance. A notable example is the theory of finite Markov 
chains as described in section 15.3 because p (P) = 1 for all transition probability matrices 
P. The next section is devoted to showing how Perron complementation is applied in the 
theory of Markov chains. 


The coupling vector (the Perron vector of C) is € = ( 3 so the Perron vector of A is 


15.5 STOCHASTIC COMPLEMENTATION 


When the concept of Perron complementation is applied to irreducible stochastic matrices, 
some useful aspects of Markov chains are produced. In particular, the Perron complemen- 
tation idea applied to Markov chains results in a technique for reducing a chain with a large 


number of states to a smaller chain without losing important characteristics. 
Consider an n-state irreducible Markov chain, and let 
Pir Piz --- Pie 
Pai P22 --- Pox ; : 
P= F ey ; (with square diagonal blocks) (15.5.1) 


Pri Pro ++: Pre 
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be a partition of the associated transition probability matrix. We know that P is an irre- 
ducible stochastic matrix with o(P) = 1 (p. 177), so the associated Perron complements 
are given by 
S; =P, + P.,.(1— P*)' Py. 

As we will see, these complements S; have additional stochastic properties, so they are 
alternately referred to as stochastic complements in the context of Markov chains. Proper- 
ties (15.4.7)-(15.4.9) on page 189 guarantee that each S; is also a nonnegative irreducible 
matrix with p (S;) = 1. Furthermore, 


Pe=e = Pye+Piie=e and P,,e+Pyfe=e 
=> Pie+Pi,e=e and e=(I—P*)'Pye 


=> S,;e=e. 


In other words, every stochastic complement S; is itself the transition probability matrix 
of some smaller Markov chain. 


To understand the relationship between the smaller chain defined by S, and the par- 
ent chain associated with P, consider the simpler (but equivalent) situation where the set 
of states {1,2,...,} is partitioned into two clusters, 


S; = {1,2,...,r}Sg = {r+1,r+2,...,n}, 


so that 
1 tie rt+l1 n 
: | Py P15 S$: = Pu + Pio(I— P22) Po, 
P= ° and (15.5.2) 
a Sp = Po + Pai(I— Pii)7!Pie. 


Focus on one of these complements—say, the second one—and interpret the (i, j)-entry 
[So]; = [P22] :; + [Pai (1 7 Pi) *Prsliy- Notice that [P22] :; is simply the probability 
of moving from state r + 74 € Sg to state r + 7 € Sp in one step, while 
[Poi (I — Pi) Pili; = S-[Parlial( = Pi) Piaa;- 
k=1 

The term [P21];x is the probability of moving from r +2 € S2 to k € S; in one step, while 
[((I — Pii)~!Pi2]x; is the probability of hitting state r + j € So the first time the chain 
enters Sz when the process starts from k € S;. This can be seen by considering the states in 
S2 to be absorbing so as to artificially force the process to stop as soon as the chain enters 
Sp. It follows from the results on absorbing chains (p. 186) that [(I — P11) ~'P4o];; is the 
probability of entering S2 at state r + 7 when the chain starts in & € S,. Consequently, 
[Poailix((I—Pi1)~'Pia]x; is the probability of moving directly from r+i € Sp tok € S} 
and then, perhaps after several steps inside of S), reentering S2 at state r + 7 (without 
regard to what happened while the process was in S;). Therefore, 


(S2]ij = [Pae]iz + S-[Parlial(1 = Pi) Pile; 
k=1 


is the probability of moving from r +72 € Sg tor+j € So ina single step or else by 
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moving directly from r + 7 € Sz to somewhere inside of S; (perhaps staying there for 
awhile) and then hitting state r + j upon first reentry into S2. In other words, So is the 
transition probability matrix for a chain that records the location of the process only when 
the process is visiting states in Sz, and visits to states in S; are simply ignored or censored 
out. 


15.6 CENSORING 


Censored Markov Chains 


For an n-state irreducible Markov chain with transition probability matrix P that 
is partitioned as in (15.5.1), let S denote the collection of states that correspond 
to the row (or column) indices of the i*” diagonal block P;;, and let S denote 
the complementary set of states. The censored Markov chain associated with S 
is defined to be the Markov chain that records the location of the parent chain 
(defined by P) only when the parent chain visits states in S. Visits to states in 
S are ignored. The transition probability matrix for this censored chain is the 
stochastic complement 


S; = Py + Pi.(I — Px) 'Pui. (15.6.1) 


Property (15.4.8) guarantees that every stochastic complement S; is an irreducible 
matrix, so every censored chain is an irreducible Markov chain. conseauenty each cen- 
sored chain has an associated stationary probability distribution, s?’, such that 


s/'S; = erat sf > 0, aa e=1 (as summarized on p. 181). 


In the language of matrix theory s/ is the left-hand Perron vector for S;, but in the context 


of Markov chains s? is called a censored probability distribution. 


To interpret the meaning of a censored distribution, suppose that the state space for 
an n-state Markov chain is partitioned into clusters as 


{1,2,...,n} =S, USgU---USz, where Si = {011, 012,---, Fin; }, (15.6.2) 
and partition the t*” step distribution and the stationary distribution in accord with (15.6.2) 


” et) = (pT) [PP | --- [PEO) #7 = (aT [aE | [eB 05.63) 


To ensure that limits exist assume the chain is primitive (page 181). Let X, be the state of 
the chain after the t¢” step, and let Y; be the cluster that contains X;, after the pth step. The 
probability of being in state o;; (the j th state of the i” cluster) after t steps is 


PX,= 64) = [p? (¢)], (the j‘” component of p/ (t)), 
and the limiting probability of being in o;,; is 
Jim P(X = 043) = Jim. [pi], = [zi], (the j*” component of 77). 
Similarly, the probability of being inside cluster S; after ¢ steps is 
P(Y: = i) = pr (de, 
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and the limiting probability of being somewhere in S; is 
Jim P(¥; = i) = Jim p; (tle=m/e. (15.6.4) 
Since 7 the left-hand Perron vector for the transition probability matrix P, it follows from 


the left-hand interpretation of (15.4.12) that the j“” component of i“” censored distribution 
s/’ with respect to the partition (15.6.2) is 


[mr], [pi ()] , POG hy 
= Leda 4: v J _ 1: tS iz) 4: = . ae 
Is]; = Me oe piwe  e Pe e 


In other words, [si], is the limiting conditional probability of being in o;; given that the 


process is somewhere in S;. Below is a summary. 


Censored Probability Distributions 


Consider an n-state irreducible Markov chain whose transition probability matrix 
P, stationary distribution w? = (a7|22|--.|a7), and state space are parti- 


tioned according to 


{1,2,...,n} =S,US_U---US, where Se = {Oran Gy occ 5 Chan le 


The censored probability distributions are the stationary distributions s? 


of the censored Markov chains defined by the stochastic complements S; given in 
(15.6.1) so that s?'S; = s?, where s? > 0 ands} e = 1. Censored distributions 
have the following additional properties. 


s) =n! /xie foreachi=1,2,...k. (15.6.5) 


e If P is primitive, then the j‘” component of s? is the limiting conditional prob- 
ability of being in the j*” state of cluster S; given that the process is somewhere 
in S;. In other words, 


[si], = Jim P(X = 045 |% = 9), 


where X; and Y; are the respective state and cluster number of the chain after 
the t*” step. 


15.7 AGGREGATION 


Now specialize the coupling theorem for Perron complements given on page 191 to Markov 
chains. Vectors are on the left-hand side of matrices for Markov chain applications, so, for 
the partition of P in (15.5.1) that corresponds to the partition of the state space in (15.6.2), 
the coupling matrix on page 191 takes the form 


A= : Fi. : 1 


sPPxie +--+ s-Pxxe Oo. sf Pei --- Pre 


sfPue --- s]Pixe sf --- 0 Pi -) Pu . tip ; 
k 


0 -- e 
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=LkxnPnrxnRnxk, (15.7.1) 


where the si’s in L are the censored distributions, and the e’s in R are columns of 1’s 
of appropriate size. (We switched the notation for the coupling matrix from C to A for 
reasons that soon will be apparent.) Remarkably, A also defines an irreducible Markov 
chain—but this chain has only k states. The nonnegativity and irreducibility of A are 
guaranteed by the coupling theorem on page 191, and A is stochastic because 


Ae = LPRe = LPe = Le$—e. 


To understand the nature of the chain defined by A along with its stationary distribution 
a’, let’s interpret the individual entries a;; = s/Pj,;e in A as probabilities. As before, let 
X;, and Y; be the respective state and cluster number of the chain after the teh step, and let 
A and V denote AND and OR, respectively. 


Given that the process is in cluster S; after t steps, consider the the probability of 
moving to cluster S; on the next step. In other words, consider 


PUY, =1A Ya = 79) 


Pin = JIM =) = +E (15.7.2) 
To determine this conditional probability, suppose that 
Pu Pik 
P=({ 20 2 J, PPO =(PT@l--lPPO). a? = (atl 1 mP) 
Pri +++) Pre 


are partitioned in accord with (15.6.2), and compute the numerator in (15.7.2) as 
P(Y, =i AYin =3) 
=P([X: = O71 Vee V Xt = Tin, | \ [Xt41 = O71 Vee V Xe41 = Cine) 


=P([X: = OF1 \ Xt41 => o51] VieoeV [Xy = Jin, N Xt41 => Cine) 
=SOVI P(X = Cig AN Xt44 = O5n) 

g=lh=1 
=S5 95 P(X = og) P(Xt41 = ojn | Xt = ig) 

g=lh=1 


=)- PP], > Pil, =. 7), uel, 


h=1 g=1 
=P; (1)Pie. 
The denominator in (15.7.2) is P(Y, = i) = p/ (t)e, and thus 


p; (t)Pije 


Pin = 91%; = 8) = 
a pi (He 


(15.7.3) 
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It follows from (15.6.5) on page 195 that > 


es Te: i p; (t) 
te sat oe im. T 5) 
Tie too pi (te 


and therefore, by (15.7.3), the entries in A are given by 


T 
. p; (t)P;,;e 
aij =s{Pije = jim i 


4 


= lim Pin =35|% = 9). (15.7.4) 


An irreducible chain is said to be in equilibrium at time (step) ¢ if the process is 
at steady state in the sense that p’(t) = a7. Consequently, (15.7.4) means that aj 1s the 
transition probability of moving from cluster S; to cluster S; after the process has achieved 
equilibrium. Below is a summary of these observations. 


Aggregation Theorem for Markov Chains 
An irreducible Markov chain whose states are partitioned into k clusters 


{1,2,...,n} =S,US,U---US, 


can be compressed into a smaller /-state aggregated chain whose states are the 

individual clusters S;. 

e The transition probability matrix A of the aggregated chain is the coupling ma- 
trix described on page 191. That is, 


s/Piie tee s7Pi,e 
n= ae (15.7.5) 


Tp Tp 
s,Peie --- s,Prre/ exp 


where P;,; is the (i, 7) block in the partitioned transition matrix P of the unag- 
gregated chain, and s?’ is the censored distribution of the i“” stochastic comple- 
ment derived from P. 


e If Y; is the cluster that the unaggregated chain occupies after ¢ steps, then, for 
primitive chains, the aggregated transition probability a;; = s} Pj,;e can be 
expressed as 

too 


In other words, transitions between states in the aggregated chain correspond to 
transitions between clusters in the unaggregated chain when the unaggregated 
chain is in equilibrium. 


As an example of the utility of aggregation in Markov chains consider the problem of 
determining the eventual probability that the chain is somewhere inside cluster S; (the in- 
dividual state in S; that the process might eventually occupy is irrelevant) without directly 
computing the stationary probabilities for the chain. In other words, if Y; is the cluster in 
which the process resides after ¢ steps, the problem is to determine lim;_,,, P(Y; = #). 


3For these limits to exist, P must be assumed to be primitive. 
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Of course, there is no problem if the stationary probabilities are known because, as 


pointed out in (15.6.4), if p’(t) = (p7(#)| --- |pf(t)) and w” = (xT | --- |), then the 
limiting probability of being somewhere in S; is 
aj = jim P(Y, = 4) = Jim p;(tle=m/e. (15.7.6) 


But computing all of 27 in a large chain just to find 7 can be wasted effort. Since 
transitions in the aggregated chain correspond to transitions between the clusters S; in the 
unaggregated chain at equilibrium, we expect the i“” component of stationary distribution 
for the aggregated chain to be the limiting probability of being in S;,, and this is true. 


e In other words, if the stationary distribution of the aggregated chain defined by A in 
(15.7.5) is a? = (a1, 02,...,Q%), then a; = we = limy... P(Y; = 1). 


4 


15.8 DISAGGREGATION 


When interpreted in the context of Markov chains, the coupling theorem on page 191 
represents an expansion or disaggregation process. Below is the formal statement of the 
disaggregation theorem. 


Disaggregation in Markov Chains 


Consider an irreducible Markov chain along with the associated aggregated chain 
for which the respective transition probability matrices are 


IPs IP, veo 1Pip. iF aF 
sjPiie --- sj; Pixe 
Tey ley) son 15, a z 
1? = ; on. and A = 3 a 4 ' 
: ; : : Tv iF 
s,Prie -:- s; Pre 
Pri Pro -:: Pre aS ko kL kt kk kxk 


where s} is the i‘” censored distribution (the stationary distribution of the i” 
stochastic complement S; = P;;+P;,(I—P*)~!P,;). Ifa? = (a1, a2,..., 0%) 
is the stationary distribution of the aggregated chain defined by A, then the sta- 
tionary distribution for the unaggregated chain defined by P is 


nr = (a,s? | 0285, | -- - RS) ). 


In other words, the censored distributions s? can be pasted together to form the 


global distribution 77’, and the a;’s provide the glue to do the job. 


It’s clear that disaggregation as stated above can serve as an algorithm for com- 
puting the stationary probabilities of any irreducible chain. But while the aggregation- 
disaggregation results are beautiful theoretical theorems, their straightforward implemen- 
tation usually doesn’t result in a computational advantage over more standard methods. 
Computing the stochastic complements S, in order to determine the censored distributions 
s?’ is generally a computationally intensive task, so as far as computation is concerned, the 
goal is to somehow exploit special structure exhibited by the chain to judiciously imple- 
ment the aggregation/disaggregation procedure. 
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When someone is seeking, it happens quite easily that he sees only the thing that 
he is seeking; that he is unable to find anything, unable to absorb anything, 
because he is only thinking of the thing he is seeking, because 

he has a goal, because he is obsessed with his goal. 


Seeking means: to have a goal; but finding means: to be free, to be receptive, to 
have no goal. You, O worthy one, are perhaps indeed a seeker, for in striving 
towards your goal, you do not see many things that are under your nose. 


— Siddhartha speaking to his friend Govinda 
in Hermann Hesse’s Siddhartha [95] 
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Chapter Sixteen 


Glossary 


anchor text text used in the hyperlink when linking from one webpage to another 
arc a link between two nodes in a graph 


authority a webpage with many inlinks; a good authority has inlinks from pages with 
high hub scores 


authority matrix the matrix L7L created in the HITS method; its dominant right-hand 
eigenvector is the authority vector, which is used to give a ranking of webpages by 
their authoritativeness 


authority score the numerical score assigned to a webpage that gives a measure of that 
page’s authoritativeness 


authority vector a vector that gives the authoritativeness of webpages; the 7“” component 
is the authority score for page 2 


blog a webpage that represents an online diary on a particular topic, which typically has 
postings sorted by time and many hyperlinks but little textual content 


Boolean model a classic model in traditional information retrieval that uses the Boolean 
operators AND, OR, and NOT to answer queries 


co-citation a term in bibliometrics that is used when two papers are cited by the same 
paper; on the Web it is used when two webpages have inlinks from the same page 


content index the part of a search engine devoted to storing information about the content 
of a webpage 


content score the information retrieval score assigned to each page; it is computed from 
traditional information retrieval factors such as similarity of the page to the query, 
use of query terms in the title, and the number of times the query terms are used in 
the page. 


co-reference a term in bibliometrics that is used when two papers cite the same paper; on 
the Web it is used when two webpages have outlinks to the same page 


crawler the part of the search engine that sends spiders to travel the Web, gathering new 
and updated webpages for the engine’s indexes 
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cycle a path in the Web’s graph that always returns back to its origin, e.g., a trivial cycle 
occurs when page A points only to page B and page B points only back to A; the 
random surfer of the PageRank model can get stuck in a cycle and circle indefinitely 
in the pages on the path, which causes convergence problems for PageRank 


dangling node a webpage with no outlinks, which creates a 07 row in the PageRank 
matrix; causes a problem for the random surfer of the PageRank model because the 
random surfer is trapped whenever he enters a dangling node 


dangling node vector the vector a that has a 1 if page 7 is a dangling node and 0 other- 
wise; used to help give the random surfer an outlet whenever he reaches a dangling 
node 


fundamental matrix the matrix (I — abS)~+ that appears in many PageRank computa- 
tions 


Google bomb a way to spam Google by using the anchor text of a hyperlink to boost the 
rank of a target page; bomb detonates whenever a query on the terms in the anchor 
text is submitted and enough pages have the appropriate anchor text for a hyperlink 
pointing to the target page; invented by Adam Mathes in 2001 as a prank against his 
friend Andy Pressman 


Google dance the shuffling of pages in the ranked list that occurs during the monthly (it’s 
speculated) updating of PageRank 


Google matrix the matrix used to determine the PageRank importance scores for web- 
pages; its dominant left-hand eigenvector is the PageRank vector; the Google matrix 
is given by G = aS + (1—a)ev? 


Googleopoly Google’s dominance of the search market 


HITS link analysis model that defines webpages as hubs and authorities and uses the graph 
structure of the web to rank webpages; developed by Jon Kleinberg in 1998; used by 
the Teoma search engine; acronym for Hypertext Induced Topic Search 


hub a webpage with many outlinks; a good hub has outlinks to pages with high authority 
scores 


hub matrix the matrix LL’ created in the HITS method; its dominant right-hand eigen- 
vector is the hub vector, which is used to give a ranking of webpages by their quality 
as portal pages 


hub score the numerical score assigned to a webpage that gives a measure of that page’s 
“hubbiness”, which is a measure of the page’s quality as a portal page 


hub vector a vector that gives the “hubbiness” of webpages; the i” component is the hub 
score for page 2 


hyperlink a link in a webpage that allows a reader to automatically jump to another page; 
creates a directed arc in the web graph 


indexer the part of the search engine that compresses a webpage from the crawler into an 
abbreviated Cliff Notes version; pulls off the essential elements of the page such as 
title, description, date, images, and tables 
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indexes where the search engine stores all its webpage information; often a search engine 
has several different indexes such as an image index, structure index, and inverted 
index 


inlink a link into a webpage 
intelligent agent a software robot designed to retrieve specific information automatically 


intelligent surfer replaces the random surfer; the intelligent surfer follows the hyperlinks 
on the Web but does not randomly decide which page to visit next; rather he chooses 
the page that best fits his needs and interests 


inverted file index the search engine’s largest index; next to each term in the engine’s 
database there is a list of all pages that use the term; similar to an index in the back 
of a book 


Jon Kleinberg Cornell University computer science professor; inventor of the HITS al- 
gorithm 


link analysis using the hyperlink structure of the Web to improve search engine rankings 


link farm a link spamming technique for boosting a page’s rank; a set of webpages that 
are densely connected 


link spamming a type of spamming that uses the Web’s hyperlinks to fool search engines 


meta-search engine a search engine that combines the results of several independent 
search engines into one unified list 


metatag hidden tag that is embedded in the HTML source code of a webpage to help 
spiders locate title, description, and keyword information in the page 


modified HITS a modification to the standard HITS method that guarantees the existence 
and uniqueness of the HITS authority and hub vectors; uses the matrices €L7L + 
(1 — €)/nee™ and €LL? + (1 — €)/nee? in place of L’L and LL’, the standard 
authority and hub matrices 


neighborhood graph the graph created in the first step of the HITS method; includes all 
pages that use the query terms as well as pages that link to and from the relevant 


pages 
netizen a citizen of the Internet 

node a vertex in a graph; webpages are nodes in the web graph 

nondangling node a webpage with at least one outlink; any page that is not dangling 
outlink a link from a webpage 


overall score the final query-dependent relevancy score given to a page; a combination of 
the popularity score and the content score 


PageRank link analysis model that uses an enormous Markov chain to rank webpages by 
importance; invented by Brin and Page in 1998; now part of Google 
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page repository where new or updated pages are temporarily stored after they are re- 
trieved by the crawler module and before they are sent to the indexer 


personalization vector the probability vector v’ > 0 in the PageRank model; used to 


fix the problems of rank sinks and cycles faced by the random surfer; can be used 
to create personalized PageRank vectors that are biased toward a particular user’s 
interests 


polysemy occurs when a word has multiple meanings, e.g., bank 


popularity score the score given to each webpage that measures the relative importance 
or popularity of that page; created from the Web’s hyperlink structure 


precision a measure of the quality of search results, specifically the ratio of the number of 
relevant documents to the total number of retrieved documents 


primitivity fix the adjustment to the PageRank model that artificially adds direct (al- 
though small in weight) connections between every page on the Web; guarantees 
the existence and uniqueness of the PageRank vector and the convergence of the 
power method to that vector 


probabilistic model a traditional information retrieval model that uses probability and 
odds ratios to identify the relevance of documents to the query 


PRO an abbreviation for the lowest PageRank score; it’s believed that sometimes the pages 
of spammers are set to PRO by Google 


pure link a page returned in the search results whose ranking is pure in the sense that the 
page’s owner did not pay the search engine for an improved ranking 


query information request that is sent to a search engine 


query-dependent a measure or model that depends on the query and is computed for each 
individual query 


query-independent any measure or model that is computed regardless of the query; one 
measure that holds for all queries 


query processing the part of the search engine that transforms the user’s query into num- 
bers that the system can handle 


random surfer a surfer who follows the hyperlink structure of the Web indefinitely by 
choosing the next page to visit at random from among the outlinking pages of the 
current page; a convenient way of describing the PageRank model 


rank sink a webpage or set of webpages that continue to suck in PageRank during the 
iterative PageRank computation; once the random surfer enters this set of pages, 
there is no escape route 


real-time an adjective for a process that responds in a short and predictable time frame; 
the time frame is usually measured in vague units such as a user’s patience threshold 


recall a measure of the quality of search results, specifically the ratio of number of relevant 
documents retrieved to the total number of relevant documents in the collection 
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relevance feedback a refining and tuning technique used by many information retrieval 
systems; a user selects a subset of retrieved documents that are deemed useful, and 
with this additional information, a revised set of generally more useful documents is 
retrieved 


relevance scoring a numerical score of a document’s relevance to the query that is pro- 
vided by most information retrieval systems 


SALSA link analysis model that combines properties of PageRank and HITS to rank web- 
pages as hubs and authorities; developed by Ronny Lempel and Shlomo Moran in 
2000; acronym for Stochastic Approach to Link Structure Analysis 


search engine optimization the process of changing a webpage to optimize its potential 
for high rankings by search engines; includes both ethical and unethical means of 
boosting rank 


Sergey Brin and Larry Page former Ph.D. candidates at Stanford University who devel- 
oped the PageRank system for ranking webpages by importance; cofounders and 
co-owners of Google 


spam any act meant to intentionally deceive a search engine; spam includes using white 
text on a white background, link spamming, cloaking, misleading meta-tag descrip- 
tions, and Google bombing 


special-purpose index the part of a search engine that is devoted to storing special pur- 
pose information such as images, PDF files, etc. 


spider part of a search engine’s crawler module that crawls the Web in search of new and 
updated pages 


sponsored link a page returned in the search results whose owner has paid the search 
engine company for an improved ranking 


stochasticity adjustment the adjustment to the original PageRank model that artificially 
forces the PageRank matrix to be stochastic; allows the random surfer to teleport to 
a new page immediately after entering a dangling node 


structure index the part of a search engine that stores information about the link structure 
of the Web 


synonymy occurs when two words have the same meaning, e.g., car and automobile 


teleport a periodic action taken by the random surfer whereby he stops following the 
Web’s hyperlink structure and immediately jumps to a new page at random; also 
occurs immediately after the random surfer enters a dangling node 


traditional information retrieval the field that studies search within nonlinked document 
collections 


TrafficRank link analysis model that uses optimization and entropy to rank webpages by 
their traffic flow; developed by John Tomlin in 2003 
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vector space model a traditional information retrieval model that thinks of documents as 
vectors in high-dimensional space and uses the angle between vectors to determine 
the similarity of documents to the query 


web graph the graph created by the Web’s hyperlink structure; the nodes in the graph are 
webpages and the arcs are hyperlinks 


web information retrieval the field that studies search within the world’s largest linked 
collection, the World Wide Web 
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